CodeAngel.org
Faith in Knowledge.Search
Archives
Categories
Syndication
Stemming in Zend Search Lucene
One of the most difficult parts of making a search engine, whether it is a small search for a single website or something as large as a web search engine, is tweaking the ability to get as much relevant results from a user's query as possible. The best way to tweak this with Lucene and Zend_Search_Lucene is the correct use of analyzers for both indexing and querying an index.
Analyzers prepare and normalize text to be indexed and are also used to parse queries. For example, unless you want your search engine to be case sensitive you would want your analyzer to have a filter that lowercases everything as it goes into the index; that same analyzer needs to be used against the query to lowercase any terms in order for it to match the terms in the index. That way a search for "query" will match "Query, qUery, query, etc..", and a search for "Query" will match the same.
Now we come to the subject of Stemming. Stemming is the act of getting related words to "stem" to the same word. For example "nationalize, nationalization, nations" should all stem to "nation". Why is this important? say for example you are searching for Ipods, if you aren't making use of stemming, your search engine will match only "ipods", but not the singular "ipod". By making use of stemming you will get more relevant results. Stemming doesn't actually have to return a term that is spelled correctly. For example "ponies" and "pony" both stem to poni. The important thing is that all related words stem to the same thing, spelled right or not.
Codeangel 0.5
Well it's been a long time since I've done an update. Mostly because my interests have drifted to video game making, AI, and guitars. I'll probably have posts soon relating to those. For now though I finally finished a bit update!
Codeangel 0.5 has the following new features:
- Trackback and Pingback support (both recieve and sending).
- Feeds are reimplemented, Also notice the syndication symbol under each article links to this comment feed
- Reimplementation of the Search Engine. Articles are searched with better granularity, Comments are now indexed as well. Also I made a Porter Stemmer so one should get more relevant full results. I'll post more about this Porter Stemmer and release it soon.
Probably won't notice much on my next planned release: Splitting my admin and the main part of the site into modules. Also going to make a better article manager. Then I'm going to add published/unpublished article support. After I get done with that I will implement Skin support and implement skins with Zend_Layout. The big thing you'll notice is a new look for codeangel.org!
RSS 1.0 Feed