CodeAngel.org
Faith in Knowledge.Search
Archives
Categories
Syndication
Stemming in Zend Search Lucene
One of the most difficult parts of making a search engine, whether it is a small search for a single website or something as large as a web search engine, is tweaking the ability to get as much relevant results from a user's query as possible. The best way to tweak this with Lucene and Zend_Search_Lucene is the correct use of analyzers for both indexing and querying an index.
Analyzers prepare and normalize text to be indexed and are also used to parse queries. For example, unless you want your search engine to be case sensitive you would want your analyzer to have a filter that lowercases everything as it goes into the index; that same analyzer needs to be used against the query to lowercase any terms in order for it to match the terms in the index. That way a search for "query" will match "Query, qUery, query, etc..", and a search for "Query" will match the same.
Now we come to the subject of Stemming. Stemming is the act of getting related words to "stem" to the same word. For example "nationalize, nationalization, nations" should all stem to "nation". Why is this important? say for example you are searching for Ipods, if you aren't making use of stemming, your search engine will match only "ipods", but not the singular "ipod". By making use of stemming you will get more relevant results. Stemming doesn't actually have to return a term that is spelled correctly. For example "ponies" and "pony" both stem to poni. The important thing is that all related words stem to the same thing, spelled right or not.
RSS 1.0 Feed