Google: Oh No You Didn’t!

October 5th, 2006

Today the tech community (at least all of my friends) are abuzz with news of Google’s new Code Search mechanism. Now, this is just cool. From this day forward when I’m struggling with some poorly documented, hard to use, or even private API, I should be able to just type it into Google code search and see how other people have managed to use it.

But that’s only the useful angle – not enough to really create buzz on the net. The two things people are having fun with today are exploring the answers to these questions:

  1. What does code search know about me?
  2. What private information does code search know about others?

The first is the natural extension of the ego search that many of us commit on a regular basis (or have RSS subscriptions set up to to do for us). It’s fun to read about yourself, especially when somebody else is doing the writing. For instance, I learned of several new “thanks to Daniel Jalkut” type comments in source code and readme files. Neat! I like that.

The second is more problematic. Google grabbed a bunch of the world’s “source code” … basically anything it could find with a suitable file extension, and made it easily searchable. What’s wrong with this? A lot of files with source-code extensions actually contain sensitive information, but have been left mistakenly world-readable on some web server. For instance, John Gruber points out the rather stunning example of WordPress database configuration files, including the database login and password information. He directs our attention towards Jason Kottke who has assembled several other interesting phenomena. I personally am amused by the search “This file contains proprietary and confidential information.”

Now, the quite reasonable reaction we’re likely to hear from Google is, “This was already public information, we’re just indexing it.”

True! But let’s not dismiss the power of indexing. Google is too big to “just index” anything. They’re the search engine of record. Too big to blunder with technology that endangers the innocent. I imagine that with 8000 employees, at least several hundred of them are smart coders who have been beta testing this service for several weeks or months. The chances of them not noticing these funny holes seems infinitely unlikely, considering that among my friends they were the first things we observed.

So what should they do? Stand in the way of progress to protect the innocent? I’m sure dealing with problems like this will become less onerous as time goes on and people become more sophisticated about protecting their own privacy, but until that happens, Google has special responsibilities. When they substantially advance the state of information retrieval on a world-wide basis, they should think about how they can soften the negative blows of those advances.

It’s hard to say what Google should have done, but even a well-publicized warning might have helped. For those who have been compromised, I imagine their view of Google would be a lot higher if the buzz last week had been on the forthcoming advancement and what it meant for everybody’s privacy.

10 Responses to “Google: Oh No You Didn’t!”

  1. Zac White Says:

    Not to mention they dig into zip, gz, and tar files. People could have purposefully compressed their source code so they could keep it from being indexed…

  2. Erik J. Barzeski Says:

    Google can only index files that are otherwise linked to from somewhere, though, right? How did Google figure out to look at a .bz2 file inside of “http://www.anaisabel.net/” for example?

    Unless all of these folks also had directory listings on, of course.

  3. Pete Clark Says:

    Also worth looking at Krugle (http://www.krugle.com) – does the same thing, somewhat nicer UI (if you like web-2.0-y stuff)

  4. alexr Says:

    It would be better if it kept a cookie with my licensing preferences. e.g. no GPL code — it’s not really free.

  5. Bob Says:

    Also: http://koders.com/

  6. bjkeefe Says:

    I think you’re completely right about this one, Daniel.

    I’m a big fan of open source, and freely available information in general. But there *are* limits. Cliff Stohl put it nicely in “Cuckoo’s Egg,” when he came up with the analogy of a small town in which few people lock their front doors. He then asks, should we be thankful to someone who comes into this town and starts testing all the doorknobs, and publishes a list of all the ones that turn?

    Being able to Google the world’s code base will be a great resource. But you’re right: Google should not have launched this new technology without more fully considering the ramifications.

  7. Scott Stevenson Says:

    I learned of several new “thanks to Daniel Jalkut” type comments in source code and readme files

    Seriously. I found out that I unknowingly contributed to Cocolicious.

  8. Bob Says:

    This problem is not just limited to code search. Run Google searches on “For Internal Use Only” or “Company Confidential” or “Not for Distribution” on the main google site.

    Lots of people have improperly configured web sites where this stuff is just left out in the open for Google to find. And it’s not limited to HTML pages. Google can search Word docs, PowerPoint files, and PDFs as well.

    Personally, I think it’s up to the owner of proprietary information to make that information secure. Google gives people an easy tool to find out if their information is available when it shouldn’t be. Better I find the security flaws through Google, than for a hacker to find them and me be none the wiser.

    Bob

  9. Jack Says:

    Bob makes an interesting point. Along the lines of, “If it’s on the internet, it’s inevitable people will find it. Having Google index it just means you need to be even more cautious.”

    Although I think right now, I tend to side a little bit more with you, Daniel. I think search engines are at least partly responsible for exposing information that wasn’t meant for the world to see.

  10. Bret Says:

    Security thru obscurity is no security at all… but, yeah – a heads-up would have probbably been a good idea too… OTOH, how would you know if something private of yours was on there, if you couldn’t access the index?

Comments are Closed.

Follow the Conversation

Stay up-to-date by subscribing to the Comments RSS Feed for this entry.