I've done some preliminary testing the way we had discussed earlier. I used the wiki article on christianity (set1.txt in the tools repo). This is what I did :
1. Compiled and installed libvarnam/master
2. Learned the word corpus found at
savannah.org3. Transliterated the "manglish" text in set1.txt and compared it with the correct malayalam word.
4. Repeated steps 1 to 3 for libvarnam/varnam_stemmer and logged the accuracy to a text file.
set1.txt contained a total of 248 words. To my disappointment, transliteration accuracy did not improve at all. I redid the experiment by setting the confidence of the stemmed words as 1 (it was 0 before), but did not see an improvement.
The test set contained complex agglutinated words and this might have been the reason that the accuracy did not improve. I will try again with a test set containing more common words.
Let me propose another approach to testing : Start from a clean database. Make varnam learn a corpus on a particular subject, and then transliterate a different article on the same subject. For example. feed a wiki article under the category "medicine" to varnam first, and then feed it another article from the same category. Some words are bound to overlap. I'd be doing this as well :)