[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Varnamproject-discuss] Improving Varnam Learning

From: Navaneeth K N
Subject: Re: [Varnamproject-discuss] Improving Varnam Learning
Date: Mon, 10 Mar 2014 10:25:01 +0530
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:24.0) Gecko/20100101 Thunderbird/24.3.0

Hash: SHA512

Hello Kevin,

Here is the answer to your other questions.

On 3/9/14 1:56 PM, Navaneeth K N wrote:
> Hello Kevin,
> Thanks for the stemming rules. I didn't get time to review it
> completely. But looks good so far.
> On 3/9/14 12:10 PM, Kevin Martin wrote:
>> A doubt regarding the vst file. It is an sqlite3 database file right? I
>> could not open it with 'sqlite3 ml.vst'. And the scheme file will be
>> compiled into a .vst file. 
> yes. It is a SQLite file. You should be able to open it with the sqlite
> utility. A scheme file can be compiled into VST file using `varnamc`.
>       varnamc --compile schemes/your_scheme_file
> For the following points, I will send you a detailed email later today.
> I would like to try out my stemming rules (check
>> attachments). Here's how I assume I should proceed :
>> 1. Write the stemming rules into the scheme file.
>> 2. Compile the scheme file. For this, stem(pattern) should match a
>> corresponding function right? Where should I specify that function? Which
>> file specifies how the scheme file should be compiled?

You are right about the approach. This is how scheme file compilation

* `varnamc` is the orchestrator of compiation. It is a Ruby script which
defines a set of functions, like `vowels`, `consonants` etc.

* `varnamc` includes the specified scheme file in the current script.
Because of this, scheme file can access all the functions that `varnamc`

* Once the scheme file is completly includes, relevant functions in
`varnamc` would have called and it will have a clear picture of what
symbols to compile.

* `varnamc` calls `varnam_create_token()` function to persist the token
to a VST file.

So in this case, `stem` will be a Ruby function that takes two
arguments. This can be defined in the `varnamc` file so that it is
available to the scheme file.

Define a new API function at the `libvarnam` (api.h) something like,

        int varnam_add_stemming_rule (varnam *handle, const char *match, const
char *replacement);

This will persist the rule into the VST file. You can make a table
something like, stemming_rules and add all the rules there.

Now add a function in learning module which takes a word to stem, reads
the above table to get the stemming rules, and apply them to the word
and return the stemmed word. `varnam_learn` has to be modified to use
the new stemming function and learn the base word.

Make a set of words in a text file, and apply stemming to each of them
to find out how your algorithm performs and how much accuracy it can
give you.

>> 3. For testing, I'd like to input a word, have it transliterated (using
>> varnam_transliterate), and THEN stemmed. This stemmed word is displayed and
>> then passed to varnam_train so as that particular pattern is always matched
>> to that word. And what is the difference between varnam_learn and
>> varnam_train?

`varnam_learn` learn a word and it figures out all the different
possibilities of writing a word.

For eg:

        varnam_learn ("?????") = ["kevin", "kewin"]

`varnam_train()` is used when varnam can't figure out a proper pattern
to type a word. Consider the word "????????". The most common way to write
this is "english". Now if you ask varnam to learn "????????"

        varnam_learn ("????????") = ["imgleesh", "imglish"]

It hasn't learned the pattern "english". In this case, you train it by
calling `varnam_train ("english", "????????")`.

>> Also, I'm drafting a definitions_list - a file containing location of the
>> definitions of a structure/function and what it does. It will not be proper
>> documentation though. I have attached a sample with this mail. Its really
>> helping me because after every 15 minutes I'll forget where a particular
>> structure/function was defined and then I'll start searching all the source
>> codes. If you'd like to have this list I'll finish it soon and submit a PR.

Cool. Thanks.

>> Thank you for your time,
>> Kevin Martin Jose
>> On Sat, Mar 8, 2014 at 4:57 PM, Kevin Martin <address@hidden>wrote:
>>> I have drafted a set of stemming rules. The file is attached with this
>>> post. Please go through it.
>>> You were right in that it is impossible to achieve 100% stemming. I took a
>>> malayalam paragraph and tried stemming the words. The main problem is that
>>> in malayalam many words are compounded together and thus is difficult to
>>> segregate. Also, the stemming rules I have provided does not mention any
>>> specific order. Those rules will have to be applied in a specific order to
>>> stem a given word. The English stemmer could do it without recursion, and I
>>> think the malayalam stemmer could too - with the right ordering.
>>> There's a number assigned to each rule - the line number. So rule 3 refers
>>> to the statement written in line 3. I have tried to provide examples where
>>> ever it seemed necessary.
>>> On Fri, Mar 7, 2014 at 10:48 PM, Kevin Martin Jose <
>>> address@hidden> wrote:
>>>>   Thanks a lot.
>>>>  ------------------------------
>>>> From: Navaneeth K N <address@hidden>
>>>> Sent: ?07-?03-?2014 09:56
>>>> To: address@hidden
>>>> Subject: Re: [Varnamproject-discuss] Improving Varnam Learning
>> Hello Kevin,
>> On 3/5/14 12:12 AM, Kevin Martin wrote:
>>>>>> I went through the vst_tokenize() function. To my disappointment,
>>>>>> understanding it was not as easy as I thought. I wrestled for a few
>> hours
>>>>>> with code and decided that I need to assimilate a few key concepts
>> before I
>>>>>> can understand what vst_tokenize does.
>>>>>> 1. What is a vpool? Why is it needed? I read its definition but I do not
>>>>>> understand its purpose or how it is used. Is it a pool of free varrays?
>>>>>>    To be more specific, I would like to know the purpose of elements
>> like
>>>>>> v_->strings_pool. What does the function get_pooled_string()
>>>>>>    return?
>> It is object pooling, a technique to reuse already allocated objects
>> rather than keep on reallocating them. This improves performance over
>> time because pool is destroyed only after the handle is destroyed.
>> Mostly you will have the handle available throughout the application.
>> get_pooled_string() returns a strbuf, a dynamically growing string type.
>>>>>> 2.What is a vcache_entry? What is the purpose of the strbuf 'cache_key'
>> in
>>>>>> vst_tokenize()? What are the contents of a vcache?
>> Vcache is a hashtable. This is another optimization technique to reuse
>> already tokenized word. For eg: when "malayalam" is transliterated,
>> tokenization happens and cache gets filled with tokens. When it is
>> transliterated again, tokenization will just use the cache and won't
>> touch the disk. This improves performance dramatically.
>>>>>> 3. What is the purpose of int tokenize_using and int match_type, the
>>>>>> parameters of vst_tokenize()?
>> A tokenization can be of two types, pattern tokenization and value
>> tokenization. Pattern tokenization is about tokenizing words which you
>> will send for transliteration. Value tokenization is on the indic text.
>> tokenize ("malayalam") = pattern tokenization
>> tokenize ("??????") = value tokenization
>> To understand how tokenization works, you can use the `print-tokens`
>> tool available in the `tools` directory. It is not compiled usually. You
>> need to pass `-DBUILD_TOOLS=true` when doing `cmake .` to get it compiled.
>>>>>> 4. Assume that a malayalam stemmer ml_stemmer() has been implemented.
>> Will
>>>>>> it replace vst_tokenize() or will the line :
>>>>>>                           base=ml_stemmer(input)
>>>>>>     be inside the vst_tokenize() function? The answer to this question
>> must
>>>>>> be pretty straight forward but I cannot see it since I do not
>>>>>>     understand vst_tokenize() yet.
>> Stemmer won't have connection to tokenization. Stemmer will be part of
>> learning subsystem. So `varnam_learn()` function will use it.
>> Also stemmer has to be configurable for each language. You need to add a
>> new function to the scheme file compiler so that you can do something
>> like the following in each scheme file.
>> stem ("????", "?")
>> This rule needs to be compiled into `vst` file and during learn it
>> should be utilized to do the stemming.
>> We may also need to fix how varnam combines two tokens. Currently, when
>> a consonant and a vowel comes together, varnam will render the
>> consonant-vowel form. But this is very basic and won't work for some
>> conditions where chill letters are involved. I will think about this and
>> draft the idea.
>>>>>> On Tue, Mar 4, 2014 at 9:56 PM, Kevin Martin <
>> address@hidden>wrote:
>>>>>>> Thank you. I have a much better idea now. Another clarification needed
>> :
>>>>>>> stem(????????????? )= ???????? or ????????
>>>>>>> Even though stemming it to ???????? makes more sense in malayalam, it
>>>>>>> would be clearer to stem 'thozhilalikalude' to 'thozhilal' (without the
>>>>>>> trailing 'i') in English. Hence IMO ??????? would be a better base word
>>>>>>> than ????????. But the examples you provided in the previous mail
>> [given
>>>>>>> below] would hold.
>>>>>>> [Examples from previous mail]
>>>>>>>> stem(??????) = ???
>>>>>>>> stem(??????????) = ??????
>>>>>>>> stem(??????????????) = ???????????
>>>>>>> On Tue, Mar 4, 2014 at 10:12 AM, Navaneeth K N <address@hidden> wrote:
>>>>>> Hello Kevin,
>>>>>> Good to see that you are making progress.
>>>>>> On 3/3/14 12:58 PM, Kevin Martin wrote:
>>>>>>>>>>> No it is prefixes. For example, when the word ?????? is learned,
>> varnam
>>>>>>>>>>> learns the prefixes, ??, ???? etc. So when it gets a pattern like
>>>>>>>>>>> "malayali", it can easily tokenize it rather than typing like
>>>>>> "malayaali".
>>>>>>>>>> 1.What do you mean by tokenization? A token is a pattern to symbol
>>>>>> mapping.
>>>>>>>>>> So tokenization means matching the entire word to its malayalam
>> symbol?
>>>>>> A tokenization is splitting the input into multiple tokens. For eg:
>>>>>> input - malayalam
>>>>>> tokens - [[ma], [la], [ya], [lam]]
>>>>>> Each will be a `vtoken` instance with relevant attributes set. For the
>>>>>> token `ma`, it will be marked as a consonant.
>>>>>> Tokenization happens left-right. It is a greedy tokenizer which find the
>>>>>> longest possible match. Look at `vst_tokenize` function to learn how it
>>>>>> works.
>>>>>>>>>> 2. The porter stemmer stems the given English word to a base word by
>>>>>>>>>> stripping it off all the suffixes. How can we stem a malayalam word?
>>>>>>>>>> Suppose that varnam is encountering the word ?????? for the first
>> time.
>>>>>> The
>>>>>>>>>> input was 'malayalam'. In this case, as of now, varnam learns to map
>>>>>> 'mala'
>>>>>>>>>> to ??, 'malaya' to ???? and so on? Hence learning a word makes
>> varnam
>>>>>> learn
>>>>>>>>>> the mappings for all its prefixes, right?
>>>>>> Something like the following:
>>>>>> stem(??????) = ???
>>>>>> stem(??????????) = ??????
>>>>>> stem(??????????????) = ???????????
>>>>>>>>>> 3. Let me propose a stemmer that rips off suffixes. Consider the
>> word
>>>>>>>>>> ?????? (malayalam) that was learned by varnam.
>>>>>>>>>> I think the goal of the stemmer should be to get the base word ?????
>>>>>>>>>> (malayal) rather than ????. To do this, I think we will need to
>> compare
>>>>>> the
>>>>>>>>>> obtained base word with the original word. Let us assume that the
>>>>>> stemming
>>>>>>>>>> algorithm got the base word 'malayal' from 'malayalam'. We can make
>> sure
>>>>>>>>>> that this is mapped to ????? rather than ???? by ripping off the
>>>>>> equivalent
>>>>>>>>>> suffix from the malayalam transliteration word. That is,
>>>>>>>>>> removing the suffix 'am' from 'malayalam' removes the ? from
>> '??????'.
>>>>>> For
>>>>>>>>>> this, 'am' needs should have been matched with ? in the scheme file.
>>>>>> Hence
>>>>>>>>>> we would get ????? for 'malayal' and this can be learned. This would
>>>>>> result
>>>>>>>>>> in the easier mapping of malayali to ?????? .
>>>>>>>>>> Another example :
>>>>>>>>>> thozhilalikalude is ?????????????
>>>>>>>>>> a).sending 'thozhilalikalude' to the stemmer, we obtain
>> 'thozhilalikal'
>>>>>> in
>>>>>>>>>> step 1. As a corresponding step  ? ?? is removed from ?????????????
>> and
>>>>>>>>>> results in ??????????. No learning occurs in this step because we
>> have
>>>>>> not
>>>>>>>>>> reached the base word yet.
>>>>>>>>>> b) 'thozhilalikal' is stemmed to 'thozhilali' - ?? is removed from
>>>>>>>>>> ??????????. Even though 'kal', the suffix that was removed, could be
>>>>>>>>>> matched to ??, we do not do that because the word before stemming
>> had
>>>>>>  ?.
>>>>>>>>>> Produces ???????? .
>>>>>>>>>> c) thozhilali is stemmed to thozhilal - Produces ??????? from
>> ????????.
>>>>>>>>>> This base word and the corresponding malayalam mapping is learned by
>>>>>> varnam.
>>>>>>>>>> I have not completed drafting the malayalam stemmer algorithm. It
>> seems
>>>>>> to
>>>>>>>>>> have many more condition checks than I had anticipated and could
>> end up
>>>>>>>>>> being larger and more complicated than the porter stemmer. But
>> before I
>>>>>>>>>> proceed, I need to know whether the logic I presented above is
>> correct.
>>>>>> You are on the right direction.
>>>>>> Stemming in Indian languages is really complex because of the way we
>>>>>> write words. So don't worry about getting 100% stemming. IMO, that is
>>>>>> impossible to achieve. So target for a stemming rules which will
>>>>>> probably give you more than 60-70% of success rate.
>>>>>> We should make this stemming rules configurable in the scheme file. So
>>>>>> in the malayalam scheme file, you define,
>>>>>>         stem(a) = b
>>>>>> this gets compiled into the `vst` file and during runtime, `libvarnam`
>>>>>> will read the stemming rule from the `vst` file and apply it to the
>>>>>> target word.
>>>>>> As part of this, we also need to implement a sort conjunct rule to
>>>>>> `libvarnam` so that it know how to combine base form and a vowel. Dont'
>>>>>> worry about this now. We will deal with it later.
>>>>>>>>>> regards,
>>>>>>>>>> Kevin Martin Jose
>>>>>>>>>> On Fri, Feb 28, 2014 at 7:50 PM, Navaneeth K N <address@hidden>
>> wrote:
>>>>>>>>>> Hello Kevin,
>>>>>>>>>> On 2/28/14 12:43 PM, Kevin Martin wrote:
>>>>>>>>>>>>> I'm seeking to improve varnam's learning capabilities as a GSoC
>>>>>> project.
>>>>>>>>>>>>> I've gone through the source code and I have doubts. I need to
>>>>>> clarify if
>>>>>>>>>>>>> my line of thinking is right. Please have a look :
>>>>>>>>>>>>> 1) Token : A token is an indivisible word. A token is the basic
>>>>>> building
>>>>>>>>>>>>> block. 'tokens' is an object (instance? I mean the non-OOP
>>>>>> equivalent of
>>>>>>>>>> an
>>>>>>>>>>>>> object) of the type varray. 'tokens' contain all the possible
>>>>>> patterns
>>>>>>>>>> of a
>>>>>>>>>>>>> token? For example, ?????? ????????????? ?????????? ????? would
>> all
>>>>>> go
>>>>>>>>>>>>> under the same varray instance 'tokens'?. And each word ( for eg
>>>>>> ?????? )
>>>>>>>>>>>>> would occupy a slot at tokens->memory I suppose. Am I right in
>> this
>>>>>>>>>> regard?
>>>>>>>>>> No.
>>>>>>>>>> In ??????, ? will be a token. `varray` is a generic datastructure
>> that
>>>>>>>>>> can keep any elements and grow the storage as required. So
>>>>>>>>>> `tokens->memory` will have the following tokens, ?, ?, ??, ??. Each
>>>>>>>>>> token known about a pattern and a value.
>>>>>>>>>> Look at the scheme file in "schemes/" directory. A token is a
>>>>>>>>>> pattern-value mapping.
>>>>>>>>>>>>> 2) I see the data type 'v_' frequently used. However,I could not
>>>>>> find its
>>>>>>>>>>>>> definition! I missed it, of course. Running ctrl+f on a few
>> source
>>>>>> files
>>>>>>>>>>>>> did not turn up the definitions. So I thought I would simply ask
>>>>>> here! I
>>>>>>>>>>>>> would be really grateful if you can tell me where it is defined
>> and
>>>>>> why
>>>>>>>>>> it
>>>>>>>>>>>>> is defined (what it does)
>>>>>>>>>> That's a dirty hack. It's a define, done at[1]. It will get
>> replaced as
>>>>>>>>>> `handle->internal` by the compiler. It is just a shorthand for
>>>>>>>>>> `handle->internal`. Not elegant, but got used to it. We will clean
>> it up
>>>>>>>>>> one day. Sorry for making the confusion.
>>>>>>>>>> [1]:
>>>>>>>>>>>>> 3) I read the porter stemmer algorithm. The ideas page say
>>>>>> *"something
>>>>>>>>>> like
>>>>>>>>>>>>> a porter stemmer implementation but integrated into the varnam
>>>>>> framework
>>>>>>>>>> so
>>>>>>>>>>>>> that new language support can be added easily"*. I really doubt
>> if
>>>>>>>>>>>>> implementing a porter stemmer would make adding new language
>> support
>>>>>> any
>>>>>>>>>>>>> easier. The English stemmer is an improvised version of the
>> original
>>>>>>>>>> porter
>>>>>>>>>>>>> stemmer. A stemming algorithm is specific to a particular
>> language
>>>>>> since
>>>>>>>>>> it
>>>>>>>>>>>>> deals with the suffixes that occur in that language. We need a
>>>>>> malayalam
>>>>>>>>>>>>> stemmer, and if we want to add support to say telugu one day, we
>>>>>> would
>>>>>>>>>> need
>>>>>>>>>>>>> a telugu stemmer. We can of course write one stemmer and add test
>>>>>> cases
>>>>>>>>>> and
>>>>>>>>>>>>> suffix condition checks in the new language so that tokenization
>> can
>>>>>> be
>>>>>>>>>>>>> done with the same function call.
>>>>>>>>>> When I said integrated into the framework, I mean make the stemmer
>>>>>>>>>> configurable at a scheme file level. Basically the scheme file will
>> have
>>>>>>>>>> a way to define the stemming. Now when a new language is added,
>> there
>>>>>>>>>> will be a new scheme file and the stemming rules for that language
>> goes
>>>>>>>>>> to the appropriate scheme file. All varnam needs to know to properly
>>>>>>>>>> evaluate those rules.
>>>>>>>>>> I am in the process of writing some documentation explaining the
>> scheme
>>>>>>>>>> file and vst files. I will send you once it is done. It will make
>> this
>>>>>>>>>> much easy to understand.
>>>>>>>>>>>>> 4) The ideas page say "Today, when a word is learned, varnam
>> takes
>>>>>> all
>>>>>>>>>> the
>>>>>>>>>>>>> possible prefixes into account". Prefixes? Shouldn't it be
>> suffixes?
>>>>>>>>>> No it is prefixes. For example, when the word ?????? is learned,
>> varnam
>>>>>>>>>> learns the prefixes, ??, ???? etc. So when it gets a pattern like
>>>>>>>>>> "malayali", it can easily tokenize it rather than typing like
>>>>>> "malayaali".
>>>>>>>>>> Suffixes won't help because tokenization is left to right. This is
>> where
>>>>>>>>>> another major improvement could be possible in varnam. If we can
>> come up
>>>>>>>>>> with tokeniation algorithm, which takes, prefixes, suffixes and
>> partial
>>>>>>>>>> matches into account, then we literally can transliterate any word.
>> But
>>>>>>>>>> its a hard problem which needs lots of research and effort. The
>> effort
>>>>>>>>>> will be doing it at a scale at which varnam is operating now. Today,
>>>>>>>>>> every key stroke that you make on the varnam editor, is searching
>> over 7
>>>>>>>>>> million patterns to predict the result. All this happens in less
>> than a
>>>>>>>>>> second. Improving tokenization and keeping the current performance
>> is a
>>>>>>>>>> *hard* problem.
>>>>>>>>>>>>> Let me try and coin a malayalam stemmer. I will post what I come
>> up
>>>>>> with
>>>>>>>>>>>>> here.
>>>>>>>>>> That's great. Feel free to ask any questions. You are already asking
>>>>>>>>>> pretty good question. Good going.
>>>>>>>>>>>>> regards,
>>>>>>>>>>>>> Kevin Martin Jose

- -- 
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools -


reply via email to

[Prev in Thread] Current Thread [Next in Thread]