Monitoring Engine Update
Posted on Saturday, May 08, 2010 by Ian Drake
I'm in the process of testing changes to the our monitoring engine, specifically the NotifyWire search engine. You see, when monitoring Craigslist for your searches, NotifyWire doesn't actually run each individual search on Craigslist. Why? Well two reasons. First, it would require a lot more hits against Craigslist's servers, and second, it would actually be slower (there's a long story there).
Instead, NotifyWire just get's all the posts and runs your searches using our own search engine. This search engine was designed to mimic the engine used by Craigslist, so the results should match. I thought it was all working OK, but have had some reports recently about missing alerts...which I take very seriously.
I've probably spent 40-50 hours this week working on a test harness and fixing issues. The test harness uses real searches from NotifyWire users and runs the search against Craigslist. I then run the same results through the NotifyWire search engine, expecting each one evaluate to a positive match. However, on the first try I calculated a 10% mismatch. I also created a test for false positives that generated a 2% mismatch. Obviously this was unacceptable, so I got to work.
The results were so off, that I'm now wondering if Craigslist had changed there search engine because it was matching results in a much more flexible manner and some logical conditions that were broken now seemed to be fixed. For instance, you couldn't group negative terms together in parentheses before, but that works now. Here's another interesting tidbit - Craigslist's displays ads as they were written, but behind the scenes indexes an automatically spell checked version of the ad. I found multiple instances where Craigslist returned results that didn't have a required term properly spelled, so I'm thinking they don't do spell checking on the fly for each search, but index the spell checked results will all HTML tags removed.
I tried to run an open source spell checker against the ads before searching them, but the results were mixed. Out of ten cases found, the spell checker only got the right word once. For the amount of processing required, it wasn't worth it, so I pulled that out. In the end, the NotifyWire search engine now has a >.1% false negative rate and a 0% false positive rate. The remaining false negatives (should have matched but didn't) are due to spelling and the fact that one user's job search was based on word "compensation" that appears in the very bottom of a job ad, outside of the user provided content. Craigslist only provides the user generated content in their RSS, so the text is never found, but Craigslist includes it in their own search index.
I'm really happy with the results of re-writing the NotifyWire search engine. I expect to release the new version on Sunday. Also in this new version will be navigation hotkeys, something that's been missing for far too long and really makes the NotifyWire application even more fun to use.
