Java Forum

Ask Question   UnAnswered
Home » Forum » Java       RSS Feeds

Internationalization in Java

  Asked By: Tracey    Date: Jan 17    Category: Java    Views: 681

How to parse a pdf file which has Arabic language(as an image) and convert that into a Text File.

The pdf file has both English and Arabic. I was able to convert the English part into text file but am not able to parse throught he Arabic.

I have tried using Pdfbox Java Library.



5 Answers Found

Answer #1    Answered By: Verner Fischer     Answered On: Jan 17

I have tried lots of softwares for converting pdf  containing Persian language to text, but non of them works.
By the way if you search google for pdf files that contain a specific Persian word it does not return accurate results as well, However I'm not sure about Arabic language. So I guess this is an open challenge at least for Persian language.

Answer #2    Answered By: Luz Hayes     Answered On: Jan 17

I have no idea as how to take of this conversion that you're looking for, but in general what you are trying to do is localization and not internationalization  - only for the title of the email sack

Answer #3    Answered By: Vidos Fischer     Answered On: Jan 17

search google for OCR programs that support arabic  and PDF format as input.

Answer #4    Answered By: Hoor Khan     Answered On: Jan 17

There are two types of PDFs: application-generated and scanned PDFs
OCR software are for the second one (recognition of the text  contained in an image  and converts it to text).
But lots of pdf  files (at least academic ones) are just application generated, which is much easier to convert  to text compared to the scanned image.
But as for Persian I found no accurate software to do this. Acrobat Reader Professional, itself, has save as (doc, rtf, txt) feature, but it really does not work accurately for Persian language, even if you choose "select text" and copy the text, when you paste it in notepad it will ruin the characters.
Arabic characters are similar to Persian, and I guess if they knew how to extract Arabic, they would add Persian support.


I have heard that OCR software for Arabic and Persian are expensive and inaccurate.

Answer #5    Answered By: Hugo Williams     Answered On: Jan 17

if you read again the original post you may notice that Ramya said pdf  has image.

Didn't find what you were looking for? Find more on Internationalization in Java Or get search suggestion and latest updates.