For my own benefit, if nothing else, since I keep seeming to need this snippet of code, I thought I’d encapsulate a Xapian More Like This/Find Similar example in a very brief blog post.
The code is stolen out of my own Habari MultiSearch plugin:
The code is very simple once you get beyond the clunkiness of the Xapian API. We create a relevance set for the document we want to find similar ones to (that’s the one with id = $search_id), and from that create an eset of the most important terms (Xapian does the heavy lifting here). We then build a new search query out of those, and do a regular query, remember to discard our original document from the search results. Not the slickest solution, but it works!
The basic idea here is actually pretty much the same in most MLT implementations - what we’re losing due to the way we’re adding the terms is any degree of weight - that term N is more important than term N+1 to the document. Some implementations let you control whether or not those weights have any effect - in Solr mlt.boost will either include or discard the weighting depending on whether it’s on or off.