Stemming in Zend Search Lucene
One of the most difficult parts of making a search engine, whether it is a small search for a single website or something as large as a web search engine, is tweaking the ability to get as much relevant results from a user's query as possible. The best way to tweak this with Lucene and Zend_Search_Lucene is the correct use of analyzers for both indexing and querying an index.
Analyzers prepare and normalize text to be indexed and are also used to parse queries. For example, unless you want your search engine to be case sensitive you would want your analyzer to have a filter that lowercases everything as it goes into the index; that same analyzer needs to be used against the query to lowercase any terms in order for it to match the terms in the index. That way a search for "query" will match "Query, qUery, query, etc..", and a search for "Query" will match the same.
Now we come to the subject of Stemming. Stemming is the act of getting related words to "stem" to the same word. For example "nationalize, nationalization, nations" should all stem to "nation". Why is this important? say for example you are searching for Ipods, if you aren't making use of stemming, your search engine will match only "ipods", but not the singular "ipod". By making use of stemming you will get more relevant results. Stemming doesn't actually have to return a term that is spelled correctly. For example "ponies" and "pony" both stem to poni. The important thing is that all related words stem to the same thing, spelled right or not.
Stemming algorithms differ from language to language. A English stemmer will NOT work for German or any other language. The most used stemming algorithm for English is the Porter Stemmer, made by Martin Porter. You can read about the Martin Porter Stemmer here. I have ported the Porter Stemmer from the Java Lucene Project for use with Zend Search Lucene. You can download it from my downloads section or directly here. This is pretty much a direct port, so If you can find some optimizations, feel free, and let me have some patches so I can update the package.
And now some super simple example code of the stemming filter in action:
<?php class Search { private $_index; public function __construct(){ $this->_index = Zend_Search_Lucene::open('/path/to/index'); $analyzer = new Zend_Search_Lucene_Analysis_Analyzer_Common_Text_CaseInsensitive(); $analyzer->addFilter(new CodeAngel_PorterStemmerFilter()); Zend_Search_Lucene_Analysis_Analyzer::setDefault($analyzer); } public function find($query){ return $this->_index->find($query); } public function addDocument($link, $content){ $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::Keyword('link', $link)); $doc->addField(Zend_Search_Lucene_Field::Unstored('contents', $content)); $this->_index->addDocument($doc); $this->_index->commit(); } } ?>
Using this class is as easy as 1-2-3:
<?php $searcher = new Search(); //index a document $searcher->addDocument('some.html', 'Hello, how about some ipods and some nationalization'); //search $hits = $searcher->find('Ipod'); ?>
RSS 1.0 Feed





Comments
No comments