Tokenisation

Taking a string and separating it into tokens is one of those smaller problems in search that seems initially simple - split on spaces - but can quickly become overwhelmed with edge cases. Ignoring the problem of other languages, some of which don’t even necessarily use a space, the exceptions tend to fall into two categories, punctuation related and normalisation.

Read More

Block Based External Sort

Memory isn’t something that we have to worry about very much in PHP, as memory management is handled for us by the Zend engine. However, when it does become an issue it becomes a very big one - most PHP script are limited as to how much memory they can consume. While this makes a lot of sense for web processes, and is in general not a problem, when you have a lot of data to deal with it can make life difficult.

Read More

How To Use Your Business Cards

I got some new business cards from work the other day, and they came in the box direct from the printer, which along with the usual ad for themselves included an instruction manual for the cards. Admittedly much of the advice involves giving out as many business cards as possible, something they might be expected to encourage, but there were a couple I wouldn’t have guessed. Some choice examples:

Read More

Simple Search: Boolean Retrieval

If you asked most people how a search engine worked, their answer would likely be a far cry from the acres of servers and vast collections that Google queries millions of times a day. That said, the intuitive view of a search engine is in many ways just a series of incremental steps away from Mountain View.

Read More

Westwood

I am a fan of the truly odd Radio 1 DJ and Pimp My Ride UK presenter Tim Westwood, and this has only been enhanced by the wonderful glimpse of his life you get from his Twitter. I’ve collected some of my favourite moments:

Read More