Logo 
Search:

Java Answers

Ask Question   UnAnswered
Home » Forum » Java       RSS Feeds
  Question Asked By: Fahimah Khan   on Jun 20 In Java Category.

  
Question Answered By: Cadencia Bernard   on Jun 20

As you know most of people use lucene  for searching through html pages, or documents, so the "content" is important. But the standard analyzer can not do stemming or lemmatization for Persian words.
As Arash mentioned, I have implemented that lucene project for IR course at Sharif Univ. for one of the students there.
My implementation just removes common words (like "Beh", "Taa", "Va", ...), but can not reduce the words to the root form (e.g. removing the "Haa" for plural form of words, ...).
As I remember none of the students has implemented better stemming for that project. Decent stemming is not easy and you may need a customized Persian dictionary.
However indexing without this stemming may be sufficient for someone's need. In this way for example if you want to search  the Persian word "Ketaab", the documents that contain "Ketaabhaa" will not be in the result of your search.

Share: