
Tamil has almost all the above mentioned issues. Ray Smith, Daria Antonova, Dar-Shyang Lee Adapting the Tesseract open source OCR engine for multilingual OCR, Published by ACM 2009 Article. Tesseract 3.0 can handle any Unicode characters (coded with UTF-8), but there are limits as to the range of languages that it will be successful with.” and “.Tesseract needs to know about different shapes of the same character by having different fonts separated explicitly.” and “.Any language that has different punctuation and numbers is going to be disadvantaged by some of the hard-coded algorithms that assume ASCII punctuation and digits.” Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters.

the Tesseract was originally designed to recognize English text only. “ The new page layout analysis for Tesseract was designed from the beginning to be language-independent, but the rest of the engine was developed for English, without a great deal of thought as to how it might work for other languages.”And in the training document for Tessaract its noted that as “. Work on the project and would love to contribute to a project in Apache I have got pretty good time( around 9 months ) to
#TAMIL OCR SOFTWARE CODE#
Into the code or actual working of the engine. I have for now only gone through the documents and not yet put my hands Variables, or ambiguity rules (the unicharambigs file), don't need Can you give us some clue as to what you think could be improvedĪbout the current Tamil recognition? Changes of configuration

Is it that the training process has to be started from the beginning Nick White Starting all over in the sense ? I have put across the efforts taken by me in the mail. So basically you would have to start all over.// imagery for doing training is not available. Thank you Paul and Nick for your Inputs Paul ,
