Re: [Varnamproject-discuss] Frequency calculation

On Sat, Apr 26, 2014 at 12:43 AM, Kevin Martin <address@hidden> wrote:

I was wrong to send the above email before looking at the code first. I saw the sql statement that is updating confidence. A similar issue is discussed at [1]. The solution they suggest is to use bigint instead of int. There is another idea that just occurred to me but it's probably too costly :

Maintain two columns confidence1 and confidence2. Confidence 2 is incremented by 1 everytime confidence 1 is a multiple of 5. That is if confidence 1 of a word is 9 and when it is encountered again confidence1 becomes 10. This will fire a trigger and increments confidence2 from 1 to 2. When confidence 1 reaches a really large value, we can simply reset it without causing any problems since confidence2 will be intact. Not a permanent solution. But lowers overflow risk by 'n' times.

On Thu, Apr 24, 2014 at 11:30 PM, Kevin Martin <address@hidden> wrote:

I want to get more familiar with the code base and was hoping to work on this issue:

https://savannah.nongnu.org/bugs/?40401

A simple but inefficient solution will be to use float instead of int. Make the frequency increment by 0.001 instead of 1. I guess that would make the whole program slower since working with floats tend to have more overhead.

I believe that we are only interested in the relative frequencies here. We can have a frequency threshold of, say, 1000. This means that if the frequency of a word exceeds that of the word with the second highest (or third, or whatever) by 1000 or more, we use a normalization function. This will result in words rarely used being reset to 0 (or 1) frequency and the frequencies of other words adjusted to scale. Sort of like the percentile system - but keeps resetting.

From:	Kevin Martin
Subject:	Re: [Varnamproject-discuss] Frequency calculation
Date:	Sat, 26 Apr 2014 00:44:09 +0530