Google: How do you do it?

So its not a big surprise that an oft-requested feature for Beagle is the ability to index a users Gmail messages (like Google Desktop Search). Today we (the Beagle developers) started to investigate just how this is done. While POP3 (and now IMAP) are available, downloading all of a users mail, indexing it, and then caching the text so we can display it. Now, my initial investigation into GDS for Linux revealed that it was calling home via POP3s and downloading lots of data. I have assumed that it was simply iterating over all messages (via POP3), downloading them, indexing them, and caching the compressed content somewhere in Google’s custom indexes.

Now, I had originally planned on this post being an open plea to any and everyone at Google asking them to open up the Gmail access API, but seeing as its just the plain old ugly POP3 (maybe a cool extension), were stuck biting the bullet and implementing a remote mail access layer.

Anyways, given how incredible Google has been in a million other situations, I thought I would throw out 2 wildly out-of-this-world questions, I wouldn’t expect to get a response, but before I spend the time figuring it all out, I felt like I should at least ask.

  • Are there some special POP Extensions available in Gmail? Is there some helper web api? Or does GDS really just have a POP3 crawler?
  • Is your compression/text storage library open source? (or documented in some research paper at all?) Beagle has always struggled with how to best handle storing copies of a documents text so that it might be made available in interfaces. While we do have a new hybrid text cache (text over 4k on the filesystem, under in a sqlite db, all compressed) we were still no where near as small as the GDS indexes. A cursory examination reveals that the GDS indexes are some form of b-tree on disk, but how are you compressing all that text so small? Is there some substitution/reconstruction algorithm? (It seems like that would be wildly expensive, but who knows).

Anyways, its a long shot, and its pretty far out there, but for the sake of not passing up answers that I can’t seem to find elsewhere on the net, I have asked.

13 Responses to “Google: How do you do it?”


  1. 1 Andy

    I’ve never used GDS, but can’t you just get it to index your gmail while running wireshark to answer question 1?

  2. 2 nick

    I’m not sure if this is related… but I know gmail supports IMAP http://en.wikipedia.org/wiki/Imap
    The IMAP protocol has some features beyond POP3 including server side search support.

  3. 3 Kevin Kubasik

    Thats how I figured out they were using POP3s, since its encrypted, I can’t see if there is a non-standard POP3 Extension in use. Either way, we would probably want to support indexing a regular POP3 account as well, but I’m more just curious.

  4. 4 Kevin Kubasik

    @nick: There is some support, however, the problem is that IMAP tends to be a much more complex protocol, the Gmail implementation isn’t 100% and right now there are some performance problems on that front.

    Also, since the indexing is really a one time sequential skim, pop really meets most of our needs. Really, I would think that most users (especially now with IMAP) could just register their Gmail accounts in Thunderbird, Evolution or Kmail to get them indexed.

  5. 5 glandium

    Surely, GDS uses some open source library for SSL encryption, so you could replace it with your own that would log the data sent to be encrypted.

    Anyways, my guess is that GDS directly gets indexes from GMail, via a proprietary Google POP extension. Why download a lot of data and index it yourself when powerful servers already did it for you ? All GDS has to do is use the same indexing algorithms as Google, which is not impossible. At least, that’s how I’d do it.

  6. 6 Walther

    Take a look at http://libgmail.sourceforge.net
    It is a python library for gmail. I think they are something http-based, but I didn’t look how it works exactly. It seems to be pretty efficient for searches.

  7. 7 grakic

    Can’t you just do HTTP to get search results from Gmail. IMHO, search results is useless if user is offline so i don’t see major drawback here.

  8. 8 Kevin Kubasik

    If we were just querying Google, then that would be an easy solution, however, many people like having search results locally, and its a marketed feature of GDS, unless I’m misunderstanding what people want/are looking for.

  9. 9 Tobu

    Re index size. In beagle, do you compress words by replacing them by a reference to an index entry (assuming a minimum length and a minimum number of occurences)? I suspect it would cut down size. You would need index entries to be permanent then (or possibly refcounted).

  10. 10 noname

    I’m not sure but maybe you find this useful: http://tokyocabinet.sourceforge.net/

  11. 11 Kevin Kubasik

    @Tobu: We don’t do this at the moment, I’ve tought about doing it, but it seems like it would cost far too much at retrieval time.. However I might look into a proof of concept implementation to test this, however its a lot of work if it ends up being too expensive =/

    @noname: yeah, I checked out some dbm implementations, I couldn’t figure out if we were gonna get compressed text content, it seems like most of them just dump the content on disk, much like our old system which took up far too much space… anyone know anything specific wrt this?

  12. 12 Tobu

    I’ve been reading these lectures:
    http://www.ee.technion.ac.il/courses/049011/spring05/index_files/Page337.html

    The second one touches on indexes. No huge insights there (delta compression maybe?), but it did clarify some concepts for me.

  13. 13 RyanTheRobot

    I seriously wonder how google does half the things that it does… and how their product quality is constantly one of the best in the industry. They are such an awesome company…

Leave a Reply