Logo 
Search:

Java Forum

Ask Question   UnAnswered
Home » Forum » Java       RSS Feeds

Using lucene in Farsi

  Asked By: Fahimah    Date: Jun 20    Category: Java    Views: 1712
  

I am planning to use lucene as a search engine for Farsi contents. It
seems that lucene does not support Farsi, and in fact it needs a Farsi
language analyzer for doing so. I couln not found such an analyzer
anywhere in the web.
Any idea or experience in this regard?

Share: 

 

13 Answers Found

 
Answer #1    Answered By: Flynn Jones     Answered On: Jun 20

I couldn't find a non commercial Persian Analyzer either. And I never had enough time to think about developing a Persian one but somebody has developed an Arabic version. It might inspire you to develop a Persian one. You can find the Arabic Lucene analyzer here.

 
Answer #2    Answered By: Kanchan Ap     Answered On: Jun 20

You can use AraMorph package for arabic analysis which also support  Farsi.

 
Answer #3    Answered By: Haya Yoshida     Answered On: Jun 20

yes there is one
once it was given as a home work in sharif university to students, some of them has put their solution on web.
if you search  for it, it may still be available.

 
Answer #4    Answered By: Geneva Morris     Answered On: Jun 20

I could not found  any project. They have just mentioned about such a
project.
If you have access to an implementation, please let me know.

 
Answer #5    Answered By: Anuja Shah     Answered On: Jun 20

we used this analyzer in Iran Rayaneh company. and it worked.
I don't work in Iran Rayaneh any more but you can go there and ask for the open source lucene  analyzer they use.
looks like there is not any active sharif student in this list.

 
Answer #6    Answered By: Muntasir Bashara     Answered On: Jun 20

Analyzer is secondary, with standard analyzer you can still index UTF content, however your search  result won’t be accurate if your willing to index “content” on the other hand if you need to index “data” ,standard analyzer is still cool for Farsi. I wrote the code three years ago ,couldn’t find it

 
Answer #7    Answered By: Cadencia Bernard     Answered On: Jun 20

As you know most of people use lucene  for searching through html pages, or documents, so the "content" is important. But the standard analyzer can not do stemming or lemmatization for Persian words.
As Arash mentioned, I have implemented that lucene project for IR course at Sharif Univ. for one of the students there.
My implementation just removes common words (like "Beh", "Taa", "Va", ...), but can not reduce the words to the root form (e.g. removing the "Haa" for plural form of words, ...).
As I remember none of the students has implemented better stemming for that project. Decent stemming is not easy and you may need a customized Persian dictionary.
However indexing without this stemming may be sufficient for someone's need. In this way for example if you want to search  the Persian word "Ketaab", the documents that contain "Ketaabhaa" will not be in the result of your search.

 
Answer #8    Answered By: Patty Freeman     Answered On: Jun 20

he has written some quality codes for me before,
although we haven't used his code and that project but looks like his solution is a good one.
So I strongly recommand to use his work!

 
Answer #9    Answered By: Johnathan Nelson     Answered On: Jun 20

By the way I had little time to do something for stemming at that time, so the main change was replacing the english stop words set with persian stop words, you can find and replace this set in StandardAnalyzer class, but its just a hasty solution (changing the code of lucene  itself). You could find better way to do this ( e.g. by passing stop words as parameter to StandardAnalyzer constructor).

 
Answer #10    Answered By: Horia Ahmed     Answered On: Jun 20

A decent farsi  analyzer for lucene  is definitely a very useful thing. It's essential to farsi any web  site developed in Java. Unfortunately the spirit of open source is not prominent in Iran at all. We are a secretive society and we love hiding our code! ;-) Why is it that there's no site for sharing tricks for making this or that framework or toolkit farsi compatible? Why this analyzer code is not open source? Why universities hide the results of their projects? It's 21 century for god's sake! I think it's fair to say that the importance of such activities is more than some obscure robocup tournament!!

We need to encourage code/knowledge sharing.

I would like to see some people volunteer to setup a persian wiki or something for java code and knowledge sharing. This may need some tiny funds too, which is obtainable definitely, from sharif or mokhaberat or some other source. On that wiki there could be links to projects hosted on other places such as sourceforge or java.net. I'm not advocating creating a pesianforge.net though! I know java.net is censored, but setting up such an alternative is way too much work.

 
Answer #11    Answered By: Sophie Williamson     Answered On: Jun 20

that is a very good idea
java.net wiki is very limited.
recently the JavaFX team has created a wiki on wikia site for their own project.
I used it and it is very useful.
I am willing to create and maintain a page for farsi  localization issues.
with link to farsi localization projects like:
calendar

 
Answer #12    Answered By: Hattie Howard     Answered On: Jun 20

I agree with you that the spirit of open source is not prominent in Iran. In fact, here, OPEN is forbidden in every area!

But about this case, that university project was one of the three projects of Information Retrieval course for that semester, and as I said no one had developed a decent analyzer for that project. My code for that analyzer has no comment and it took a long time for me to remember what I had done to the source code of lucene, however changing the source code of lucene  make you stick to that version, and with those changes it will not work for English documents anymore.
I would not even use that code myself if I want to write a Persian analyzer, I will just use the idea  that I mentioned on that email, so it is not worth publishing.

By the way you can hardly find a good result from university projects assigned to B.Sc. students, because they are accumulatively just about 4 points of one course beside other courses that have been taken for that semester, so if they share their time evenly, they can spend little time on it.

However you should consider that the university is a place for research and science and is different from industry, so you may not see any sensible or valuable industrial outcome from universities, but they will train talented students and there are many professionals in our industry that are graduated from these universities.

Achieving top rank in Robocup Tournaments is not easy and is a real honor. It shows that there are talented students in Iran. So major companies should show more interest in recruiting these students, to encourage them to research and work in industry and stay in Iran.

I know some students who work on projects that will be beneficial for industry; they will soon find a sponsor company and will sell the result of their work. We should expect that companies to make that result Open Source, because they have the fund to do this, and have the tools to benefit from making the product open source.

I hope that your Persian wiki idea goes on.

 
Answer #13    Answered By: Adanalie Garcia     Answered On: Jun 20

As you correctly mentioned, it is more a economical and cultural problem than a technological one. I thing there already are plenty of places, you can share your code.

Consider www.farsiweb.info as a case study, they were hosted in Sharif Computing center, very nice and clever guys and Open Source oriented minds. But what they could share ?

A jalali API, which is not updated for years, a codepage dll for windows, maybe one or two other minor things. Then they moved to a commercial project.

I remember Arash was also trying to create an open source for a new portal project.

It is not a big deal to create a site for such a place, neither from financial resources not for maintenance. But the problem is who will contribute.

 
Didn't find what you were looking for? Find more on Using lucene in Farsi Or get search suggestion and latest updates.




Tagged: