It was many a long years ago when a brave man first suggested that beagle buckle down and offer comprehensive and complete indexing of archives. Not just plain text files, or certain subsets, beagle was to inspect every file, inside of an archive or out with the same degree of scrutiny.This was suggested over a year ago, and just this week, thanks to dBera’s hard work, it finally got into CVS. Its not done, its not polished, but its there, and its functional. The biggest accomplishment by far was the addition of graceful and complete child indexables support in the file system backend. While at first this seems like a very technical detail, and not of much interest, it actually enables beagle to do a lot of things that it couldn’t before. For example, if you happen to have some old mail files lying around with an embedded attachment, beagle will read and index those correctly, almost as if it were from a mail backend. (Theres still work to do if we want that as a use case, but its funny that it happened). This also means beagle can handle those pesky gzipped man/monodoc pages (not quite yet, but soon, I have to add single entry support still).
Despite the rampant awesomeness that is just radiating out of beagle at the moment, the Archive support still has a long way to go, I filed a bug about it, but wanted to go into more depth here.
1) We have an open connection to the archive for the entire time it takes us to index it, this could be into hours for large archives, and if we were to crash, we just killed someones backup.... 2) GUI side of things, not a while lot to say here other than we should probably try to do something intelligent with all the results as well (maybe collapse them into one result that can be expanded, like we talked about doing with the WebHistory), other than that, nothing to crucial, it would be nice if we added a context menu option to extract the file in question, but thats defiantly not on the short list. 3) File Locking, its rare and sporatic, as well as near impossible to reproduce, but when you get a couple thousand temporary files going asyncronously, it was bound to happen. Beagle recovers smoothly, but its probably indicitive of a greater instability that goes hand in hand with that many asyn actions. 4) Mid Index Shutdowns don't purge temporary files, it left me with a 300 mb build up when I was testing.
Which leaves us with our work cut out for us, but I already started making the ArchiveFilter a little more sane. I have this patch, which takes care of most of the crazy temporary file madness and keeps the ChildIndexables cleaning up after themselves. While I was trying to figure out how to handle single entry Archives within our current setup, I realized that its probably just easier to create a FilterCompressed filter or the like which handles individually compressed files (like manpages and the like). I dunno, I’m looking at both options, but considering its pretty common to have beagle index someones /usr/share/doc folder, the manpage parser should probably air on the lighter and faster side, since the full ArchiveFilter is a bit of a memory hog (comes with the territory). Although another goal of mine is to prevent us from reading all the child indexables into one array, and just make it an enumerator style system.
In short, ther are a million things happening with beagle, I just chose to blog this one today, maybe I’ll mention some of the other cool stuff tomorrow.
0 Responses to “Beagle and the Legend of the Archive Filter”
Leave a Reply