[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Varnamproject-discuss] Measuring improvement

From: Kevin Martin
Subject: Re: [Varnamproject-discuss] Measuring improvement
Date: Thu, 14 Aug 2014 00:07:58 +0530

Please note that all the 408 words were different (unique). Attaching the results with this mail. without_stem.txt is basically just the words given as the input, since no words apart from those in the corpus (input) are added to the database in the absence of the stemmer.

On Wed, Aug 13, 2014 at 11:54 PM, Kevin Martin <address@hidden> wrote:
Here are some results I obtained. I used 408 words from 0.txt in the word corpus to test the improvement. Feeding 408 words to the stemmer resulted in more than 700 new words. However, some of them were intermediate stages of the stemmer and did not have any meaning. So I discarded all the meaningless words to obtain a total of 668 words left in the exported text file. That is, 260 words are new. Thus, for this test case, the stemmer improves suggestions (learning) by 63%.

A cause of concern is the amount of noise the stemmer generates. But I guess all the extra meaningless (that is, substring of a meaningful word) words generated by the stemmer will definitely be stored in the database once the user actually types them. For example :

കാലമായ് 1
കാലമ് 1
കാലം 1

കാലമായ് is the original word. കാലമ് is the intermediate meaningless word generated by the stemmer. Almost all words with a suffix generates noise like this. I have set the confidence level of the words learned from the stemmer as zero. So I believe this 'noise' might not interfere with the accuracy of suggestions when the user is typing.

I will try to do similar testing with some wikipedia articles.

On Wed, Aug 13, 2014 at 10:31 PM, Kevin Martin <address@hidden> wrote:

I was testing out the accuracy of transliteration before and after applying the stem patch. With a very small paragraph after learning only the words in 0.txt, there's an improvement of only 1 word. But we would be testing transliteration then right? Wouldn't it be more meaningful if we feed the entire word corpus into varnam, and then export the suggestions database and compare with the original word corpus? The new exported corpus should be larger than the original one.

For accurate metrics, I can perhaps do the same for a corpus of 1000 words and see how many new meaningful words are added to the corpus. What do you think?

Attachment: without_stem.txt
Description: Text document

Attachment: with_stem_cleaned.txt
Description: Text document

Attachment: with_stem_unclean.txt
Description: Text document

reply via email to

[Prev in Thread] Current Thread [Next in Thread]