I'm seeking to improve varnam's learning capabilities as a GSoC project. I've gone through the source code and I have doubts. I need to clarify if my line of thinking is right. Please have a look :
1) Token : A token is an indivisible word. A token is the basic
building block. 'tokens' is an object (instance? I mean the non-OOP
equivalent of an object) of the type varray. 'tokens' contain all the
possible patterns of a token? For example, മലയാളം മലയാളത്തിന്റെ
മലയാളത്തിൽ മലയാള would all go under the same varray instance 'tokens'?.
And each word ( for eg മലയാളം ) would occupy a slot at tokens->memory
I suppose. Am I right in this regard?
2) I see the data type 'v_' frequently used. However,I
could not find its definition! I missed it, of course. Running ctrl+f on
a few source files did not turn up the definitions. So I thought I
would simply ask here! I would be really grateful if you can tell me
where it is defined and why it is defined (what it does)
3) I read the porter stemmer algorithm. The ideas page say
"something like a porter stemmer implementation but integrated into the varnam framework so that
new language support can be added easily". I really doubt if implementing a porter stemmer would make adding new language support any easier. The English stemmer is an improvised version of the original porter stemmer. A stemming algorithm is specific to a particular language since it deals with the suffixes that occur in that language. We need a malayalam stemmer, and if we want to add support to say telugu one day, we would need a telugu stemmer. We can of course write one stemmer and add test cases and suffix condition checks in the new language so that tokenization can be done with the same function call.
4) The ideas page say "Today, when a word is learned, varnam takes all the possible prefixes into account". Prefixes? Shouldn't it be suffixes?