varnamproject-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Varnamproject-discuss] Improving Varnam Learning


From: Kevin Martin
Subject: [Varnamproject-discuss] Improving Varnam Learning
Date: Fri, 28 Feb 2014 12:43:43 +0530

I'm seeking to improve varnam's learning capabilities as a GSoC project. I've gone through the source code and I have doubts. I need to clarify if my line of thinking is right. Please have a look :

1) Token : A token is an indivisible word. A token is the basic building block. 'tokens' is an object (instance? I mean the non-OOP equivalent of an object) of the type varray. 'tokens' contain all the possible patterns of a token? For example, മലയാളം മലയാളത്തിന്റെ മലയാളത്തിൽ മലയാള would all go under the same varray instance 'tokens'?. And each word ( for eg മലയാളം ) would occupy a slot at tokens->memory I suppose. Am I right in this regard?

2) I see the data type 'v_' frequently used. However,I could not find its definition! I missed it, of course. Running ctrl+f on a few source files did not turn up the definitions. So I thought I would simply ask here! I would be really grateful if you can tell me where it is defined and why it is defined (what it does)

3) I read the porter stemmer algorithm. The ideas page say "something like a porter stemmer implementation but integrated into the varnam framework so that new language support can be added easily". I really doubt if implementing a porter stemmer would make adding new language support any easier. The English stemmer is an improvised version of the original porter stemmer. A stemming algorithm is specific to a particular language since it deals with the suffixes that occur in that language. We need a malayalam stemmer, and if we want to add support to say telugu one day, we would need a telugu stemmer. We can of course write one stemmer and add test cases and suffix condition checks in the new language so that tokenization can be done with the same function call.


4) The ideas page say "Today, when a word is learned, varnam takes all the possible prefixes into account". Prefixes? Shouldn't it be suffixes?

Let me try and coin a malayalam stemmer. I will post what I come up with here.

regards,

Kevin Martin Jose

reply via email to

[Prev in Thread] Current Thread [Next in Thread]