varnamproject-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Varnamproject-discuss] Improving Varnam Learning


From: Navaneeth K N
Subject: Re: [Varnamproject-discuss] Improving Varnam Learning
Date: Mon, 10 Mar 2014 10:25:01 +0530
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:24.0) Gecko/20100101 Thunderbird/24.3.0

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hello Kevin,

Here is the answer to your other questions.

On 3/9/14 1:56 PM, Navaneeth K N wrote:
> Hello Kevin,
> 
> Thanks for the stemming rules. I didn't get time to review it
> completely. But looks good so far.
> 
> On 3/9/14 12:10 PM, Kevin Martin wrote:
>> A doubt regarding the vst file. It is an sqlite3 database file right? I
>> could not open it with 'sqlite3 ml.vst'. And the scheme file will be
>> compiled into a .vst file. 
> 
> yes. It is a SQLite file. You should be able to open it with the sqlite
> utility. A scheme file can be compiled into VST file using `varnamc`.
> 
>       varnamc --compile schemes/your_scheme_file
> 
> For the following points, I will send you a detailed email later today.
> 
> I would like to try out my stemming rules (check
>> attachments). Here's how I assume I should proceed :
> 
>> 1. Write the stemming rules into the scheme file.
> 
>> 2. Compile the scheme file. For this, stem(pattern) should match a
>> corresponding function right? Where should I specify that function? Which
>> file specifies how the scheme file should be compiled?

You are right about the approach. This is how scheme file compilation
works.

* `varnamc` is the orchestrator of compiation. It is a Ruby script which
defines a set of functions, like `vowels`, `consonants` etc.

* `varnamc` includes the specified scheme file in the current script.
Because of this, scheme file can access all the functions that `varnamc`
exposes.

* Once the scheme file is completly includes, relevant functions in
`varnamc` would have called and it will have a clear picture of what
symbols to compile.

* `varnamc` calls `varnam_create_token()` function to persist the token
to a VST file.

So in this case, `stem` will be a Ruby function that takes two
arguments. This can be defined in the `varnamc` file so that it is
available to the scheme file.

Define a new API function at the `libvarnam` (api.h) something like,

        int varnam_add_stemming_rule (varnam *handle, const char *match, const
char *replacement);

This will persist the rule into the VST file. You can make a table
something like, stemming_rules and add all the rules there.

Now add a function in learning module which takes a word to stem, reads
the above table to get the stemming rules, and apply them to the word
and return the stemmed word. `varnam_learn` has to be modified to use
the new stemming function and learn the base word.

Make a set of words in a text file, and apply stemming to each of them
to find out how your algorithm performs and how much accuracy it can
give you.

> 
>> 3. For testing, I'd like to input a word, have it transliterated (using
>> varnam_transliterate), and THEN stemmed. This stemmed word is displayed and
>> then passed to varnam_train so as that particular pattern is always matched
>> to that word. And what is the difference between varnam_learn and
>> varnam_train?

`varnam_learn` learn a word and it figures out all the different
possibilities of writing a word.

For eg:

        varnam_learn ("?????") = ["kevin", "kewin"]

`varnam_train()` is used when varnam can't figure out a proper pattern
to type a word. Consider the word "????????". The most common way to write
this is "english". Now if you ask varnam to learn "????????"

        varnam_learn ("????????") = ["imgleesh", "imglish"]

It hasn't learned the pattern "english". In this case, you train it by
calling `varnam_train ("english", "????????")`.

> 
> 
>> Also, I'm drafting a definitions_list - a file containing location of the
>> definitions of a structure/function and what it does. It will not be proper
>> documentation though. I have attached a sample with this mail. Its really
>> helping me because after every 15 minutes I'll forget where a particular
>> structure/function was defined and then I'll start searching all the source
>> codes. If you'd like to have this list I'll finish it soon and submit a PR.

Cool. Thanks.

> 
>> Thank you for your time,
> 
>> Kevin Martin Jose
> 
> 
> 
>> On Sat, Mar 8, 2014 at 4:57 PM, Kevin Martin <address@hidden>wrote:
> 
>>> I have drafted a set of stemming rules. The file is attached with this
>>> post. Please go through it.
>>>
>>> You were right in that it is impossible to achieve 100% stemming. I took a
>>> malayalam paragraph and tried stemming the words. The main problem is that
>>> in malayalam many words are compounded together and thus is difficult to
>>> segregate. Also, the stemming rules I have provided does not mention any
>>> specific order. Those rules will have to be applied in a specific order to
>>> stem a given word. The English stemmer could do it without recursion, and I
>>> think the malayalam stemmer could too - with the right ordering.
>>>
>>> There's a number assigned to each rule - the line number. So rule 3 refers
>>> to the statement written in line 3. I have tried to provide examples where
>>> ever it seemed necessary.
>>>
>>>
>>> On Fri, Mar 7, 2014 at 10:48 PM, Kevin Martin Jose <
>>> address@hidden> wrote:
>>>
>>>>   Thanks a lot.
>>>>  ------------------------------
>>>> From: Navaneeth K N <address@hidden>
>>>> Sent: ?07-?03-?2014 09:56
>>>> To: address@hidden
>>>> Subject: Re: [Varnamproject-discuss] Improving Varnam Learning
>>>>
>> Hello Kevin,
> 
>> On 3/5/14 12:12 AM, Kevin Martin wrote:
>>>>>> I went through the vst_tokenize() function. To my disappointment,
>>>>>> understanding it was not as easy as I thought. I wrestled for a few
>> hours
>>>>>> with code and decided that I need to assimilate a few key concepts
>> before I
>>>>>> can understand what vst_tokenize does.
>>>>>>
>>>>>> 1. What is a vpool? Why is it needed? I read its definition but I do not
>>>>>> understand its purpose or how it is used. Is it a pool of free varrays?
>>>>>>    To be more specific, I would like to know the purpose of elements
>> like
>>>>>> v_->strings_pool. What does the function get_pooled_string()
>>>>>>    return?
> 
>> It is object pooling, a technique to reuse already allocated objects
>> rather than keep on reallocating them. This improves performance over
>> time because pool is destroyed only after the handle is destroyed.
>> Mostly you will have the handle available throughout the application.
> 
>> get_pooled_string() returns a strbuf, a dynamically growing string type.
> 
>>>>>>
>>>>>> 2.What is a vcache_entry? What is the purpose of the strbuf 'cache_key'
>> in
>>>>>> vst_tokenize()? What are the contents of a vcache?
> 
>> Vcache is a hashtable. This is another optimization technique to reuse
>> already tokenized word. For eg: when "malayalam" is transliterated,
>> tokenization happens and cache gets filled with tokens. When it is
>> transliterated again, tokenization will just use the cache and won't
>> touch the disk. This improves performance dramatically.
> 
>>>>>>
>>>>>> 3. What is the purpose of int tokenize_using and int match_type, the
>>>>>> parameters of vst_tokenize()?
> 
>> A tokenization can be of two types, pattern tokenization and value
>> tokenization. Pattern tokenization is about tokenizing words which you
>> will send for transliteration. Value tokenization is on the indic text.
> 
>> tokenize ("malayalam") = pattern tokenization
>> tokenize ("??????") = value tokenization
> 
>> To understand how tokenization works, you can use the `print-tokens`
>> tool available in the `tools` directory. It is not compiled usually. You
>> need to pass `-DBUILD_TOOLS=true` when doing `cmake .` to get it compiled.
> 
>>>>>>
>>>>>> 4. Assume that a malayalam stemmer ml_stemmer() has been implemented.
>> Will
>>>>>> it replace vst_tokenize() or will the line :
>>>>>>
>>>>>>                           base=ml_stemmer(input)
>>>>>>
>>>>>>     be inside the vst_tokenize() function? The answer to this question
>> must
>>>>>> be pretty straight forward but I cannot see it since I do not
>>>>>>     understand vst_tokenize() yet.
> 
>> Stemmer won't have connection to tokenization. Stemmer will be part of
>> learning subsystem. So `varnam_learn()` function will use it.
> 
>> Also stemmer has to be configurable for each language. You need to add a
>> new function to the scheme file compiler so that you can do something
>> like the following in each scheme file.
> 
>> stem ("????", "?")
> 
>> This rule needs to be compiled into `vst` file and during learn it
>> should be utilized to do the stemming.
> 
>> We may also need to fix how varnam combines two tokens. Currently, when
>> a consonant and a vowel comes together, varnam will render the
>> consonant-vowel form. But this is very basic and won't work for some
>> conditions where chill letters are involved. I will think about this and
>> draft the idea.
> 
> 
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 4, 2014 at 9:56 PM, Kevin Martin <
>> address@hidden>wrote:
>>>>>>
>>>>>>> Thank you. I have a much better idea now. Another clarification needed
>> :
>>>>>>>
>>>>>>> stem(????????????? )= ???????? or ????????
>>>>>>>
>>>>>>> Even though stemming it to ???????? makes more sense in malayalam, it
>>>>>>> would be clearer to stem 'thozhilalikalude' to 'thozhilal' (without the
>>>>>>> trailing 'i') in English. Hence IMO ??????? would be a better base word
>>>>>>> than ????????. But the examples you provided in the previous mail
>> [given
>>>>>>> below] would hold.
>>>>>>>
>>>>>>> [Examples from previous mail]
>>>>>>>
>>>>>>>
>>>>>>>> stem(??????) = ???
>>>>>>>> stem(??????????) = ??????
>>>>>>>> stem(??????????????) = ???????????
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 4, 2014 at 10:12 AM, Navaneeth K N <address@hidden> wrote:
>>>>>>>
>>>>>> Hello Kevin,
>>>>>>
>>>>>> Good to see that you are making progress.
>>>>>>
>>>>>> On 3/3/14 12:58 PM, Kevin Martin wrote:
>>>>>>>>>>> No it is prefixes. For example, when the word ?????? is learned,
>> varnam
>>>>>>>>>>> learns the prefixes, ??, ???? etc. So when it gets a pattern like
>>>>>>>>>>> "malayali", it can easily tokenize it rather than typing like
>>>>>> "malayaali".
>>>>>>>>>>
>>>>>>>>>> 1.What do you mean by tokenization? A token is a pattern to symbol
>>>>>> mapping.
>>>>>>>>>> So tokenization means matching the entire word to its malayalam
>> symbol?
>>>>>>
>>>>>> A tokenization is splitting the input into multiple tokens. For eg:
>>>>>>
>>>>>> input - malayalam
>>>>>> tokens - [[ma], [la], [ya], [lam]]
>>>>>>
>>>>>> Each will be a `vtoken` instance with relevant attributes set. For the
>>>>>> token `ma`, it will be marked as a consonant.
>>>>>>
>>>>>> Tokenization happens left-right. It is a greedy tokenizer which find the
>>>>>> longest possible match. Look at `vst_tokenize` function to learn how it
>>>>>> works.
>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2. The porter stemmer stems the given English word to a base word by
>>>>>>>>>> stripping it off all the suffixes. How can we stem a malayalam word?
>>>>>>>>>> Suppose that varnam is encountering the word ?????? for the first
>> time.
>>>>>> The
>>>>>>>>>> input was 'malayalam'. In this case, as of now, varnam learns to map
>>>>>> 'mala'
>>>>>>>>>> to ??, 'malaya' to ???? and so on? Hence learning a word makes
>> varnam
>>>>>> learn
>>>>>>>>>> the mappings for all its prefixes, right?
>>>>>>
>>>>>> Something like the following:
>>>>>>
>>>>>> stem(??????) = ???
>>>>>> stem(??????????) = ??????
>>>>>> stem(??????????????) = ???????????
>>>>>>
>>>>>>
>>>>>>>>>>
>>>>>>>>>> 3. Let me propose a stemmer that rips off suffixes. Consider the
>> word
>>>>>>>>>> ?????? (malayalam) that was learned by varnam.
>>>>>>>>>> I think the goal of the stemmer should be to get the base word ?????
>>>>>>>>>> (malayal) rather than ????. To do this, I think we will need to
>> compare
>>>>>> the
>>>>>>>>>> obtained base word with the original word. Let us assume that the
>>>>>> stemming
>>>>>>>>>> algorithm got the base word 'malayal' from 'malayalam'. We can make
>> sure
>>>>>>>>>> that this is mapped to ????? rather than ???? by ripping off the
>>>>>> equivalent
>>>>>>>>>> suffix from the malayalam transliteration word. That is,
>>>>>>>>>>
>>>>>>>>>> removing the suffix 'am' from 'malayalam' removes the ? from
>> '??????'.
>>>>>> For
>>>>>>>>>> this, 'am' needs should have been matched with ? in the scheme file.
>>>>>> Hence
>>>>>>>>>> we would get ????? for 'malayal' and this can be learned. This would
>>>>>> result
>>>>>>>>>> in the easier mapping of malayali to ?????? .
>>>>>>>>>>
>>>>>>>>>> Another example :
>>>>>>>>>>
>>>>>>>>>> thozhilalikalude is ?????????????
>>>>>>>>>>
>>>>>>>>>> a).sending 'thozhilalikalude' to the stemmer, we obtain
>> 'thozhilalikal'
>>>>>> in
>>>>>>>>>> step 1. As a corresponding step  ? ?? is removed from ?????????????
>> and
>>>>>>>>>> results in ??????????. No learning occurs in this step because we
>> have
>>>>>> not
>>>>>>>>>> reached the base word yet.
>>>>>>>>>> b) 'thozhilalikal' is stemmed to 'thozhilali' - ?? is removed from
>>>>>>>>>> ??????????. Even though 'kal', the suffix that was removed, could be
>>>>>>>>>> matched to ??, we do not do that because the word before stemming
>> had
>>>>>>  ?.
>>>>>>>>>> Produces ???????? .
>>>>>>>>>> c) thozhilali is stemmed to thozhilal - Produces ??????? from
>> ????????.
>>>>>>>>>> This base word and the corresponding malayalam mapping is learned by
>>>>>> varnam.
>>>>>>>>>>
>>>>>>>>>> I have not completed drafting the malayalam stemmer algorithm. It
>> seems
>>>>>> to
>>>>>>>>>> have many more condition checks than I had anticipated and could
>> end up
>>>>>>>>>> being larger and more complicated than the porter stemmer. But
>> before I
>>>>>>>>>> proceed, I need to know whether the logic I presented above is
>> correct.
>>>>>>
>>>>>> You are on the right direction.
>>>>>>
>>>>>> Stemming in Indian languages is really complex because of the way we
>>>>>> write words. So don't worry about getting 100% stemming. IMO, that is
>>>>>> impossible to achieve. So target for a stemming rules which will
>>>>>> probably give you more than 60-70% of success rate.
>>>>>>
>>>>>> We should make this stemming rules configurable in the scheme file. So
>>>>>> in the malayalam scheme file, you define,
>>>>>>
>>>>>>         stem(a) = b
>>>>>>
>>>>>> this gets compiled into the `vst` file and during runtime, `libvarnam`
>>>>>> will read the stemming rule from the `vst` file and apply it to the
>>>>>> target word.
>>>>>>
>>>>>> As part of this, we also need to implement a sort conjunct rule to
>>>>>> `libvarnam` so that it know how to combine base form and a vowel. Dont'
>>>>>> worry about this now. We will deal with it later.
>>>>>>
>>>>>>>>>>
>>>>>>>>>> regards,
>>>>>>>>>>
>>>>>>>>>> Kevin Martin Jose
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 28, 2014 at 7:50 PM, Navaneeth K N <address@hidden>
>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello Kevin,
>>>>>>>>>>
>>>>>>>>>> On 2/28/14 12:43 PM, Kevin Martin wrote:
>>>>>>>>>>>>> I'm seeking to improve varnam's learning capabilities as a GSoC
>>>>>> project.
>>>>>>>>>>>>> I've gone through the source code and I have doubts. I need to
>>>>>> clarify if
>>>>>>>>>>>>> my line of thinking is right. Please have a look :
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) Token : A token is an indivisible word. A token is the basic
>>>>>> building
>>>>>>>>>>>>> block. 'tokens' is an object (instance? I mean the non-OOP
>>>>>> equivalent of
>>>>>>>>>> an
>>>>>>>>>>>>> object) of the type varray. 'tokens' contain all the possible
>>>>>> patterns
>>>>>>>>>> of a
>>>>>>>>>>>>> token? For example, ?????? ????????????? ?????????? ????? would
>> all
>>>>>> go
>>>>>>>>>>>>> under the same varray instance 'tokens'?. And each word ( for eg
>>>>>> ?????? )
>>>>>>>>>>>>> would occupy a slot at tokens->memory I suppose. Am I right in
>> this
>>>>>>>>>> regard?
>>>>>>>>>>
>>>>>>>>>> No.
>>>>>>>>>>
>>>>>>>>>> In ??????, ? will be a token. `varray` is a generic datastructure
>> that
>>>>>>>>>> can keep any elements and grow the storage as required. So
>>>>>>>>>> `tokens->memory` will have the following tokens, ?, ?, ??, ??. Each
>>>>>>>>>> token known about a pattern and a value.
>>>>>>>>>>
>>>>>>>>>> Look at the scheme file in "schemes/" directory. A token is a
>>>>>>>>>> pattern-value mapping.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) I see the data type 'v_' frequently used. However,I could not
>>>>>> find its
>>>>>>>>>>>>> definition! I missed it, of course. Running ctrl+f on a few
>> source
>>>>>> files
>>>>>>>>>>>>> did not turn up the definitions. So I thought I would simply ask
>>>>>> here! I
>>>>>>>>>>>>> would be really grateful if you can tell me where it is defined
>> and
>>>>>> why
>>>>>>>>>> it
>>>>>>>>>>>>> is defined (what it does)
>>>>>>>>>>
>>>>>>>>>> That's a dirty hack. It's a define, done at[1]. It will get
>> replaced as
>>>>>>>>>> `handle->internal` by the compiler. It is just a shorthand for
>>>>>>>>>> `handle->internal`. Not elegant, but got used to it. We will clean
>> it up
>>>>>>>>>> one day. Sorry for making the confusion.
>>>>>>>>>>
>>>>>>>>>> [1]:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>> https://gitorious.org/varnamproject/libvarnam/source/68a17b6e2e5d114d6a606a9a47294917655a167f:util.h#L26
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3) I read the porter stemmer algorithm. The ideas page say
>>>>>> *"something
>>>>>>>>>> like
>>>>>>>>>>>>> a porter stemmer implementation but integrated into the varnam
>>>>>> framework
>>>>>>>>>> so
>>>>>>>>>>>>> that new language support can be added easily"*. I really doubt
>> if
>>>>>>>>>>>>> implementing a porter stemmer would make adding new language
>> support
>>>>>> any
>>>>>>>>>>>>> easier. The English stemmer is an improvised version of the
>> original
>>>>>>>>>> porter
>>>>>>>>>>>>> stemmer. A stemming algorithm is specific to a particular
>> language
>>>>>> since
>>>>>>>>>> it
>>>>>>>>>>>>> deals with the suffixes that occur in that language. We need a
>>>>>> malayalam
>>>>>>>>>>>>> stemmer, and if we want to add support to say telugu one day, we
>>>>>> would
>>>>>>>>>> need
>>>>>>>>>>>>> a telugu stemmer. We can of course write one stemmer and add test
>>>>>> cases
>>>>>>>>>> and
>>>>>>>>>>>>> suffix condition checks in the new language so that tokenization
>> can
>>>>>> be
>>>>>>>>>>>>> done with the same function call.
>>>>>>>>>>
>>>>>>>>>> When I said integrated into the framework, I mean make the stemmer
>>>>>>>>>> configurable at a scheme file level. Basically the scheme file will
>> have
>>>>>>>>>> a way to define the stemming. Now when a new language is added,
>> there
>>>>>>>>>> will be a new scheme file and the stemming rules for that language
>> goes
>>>>>>>>>> to the appropriate scheme file. All varnam needs to know to properly
>>>>>>>>>> evaluate those rules.
>>>>>>>>>>
>>>>>>>>>> I am in the process of writing some documentation explaining the
>> scheme
>>>>>>>>>> file and vst files. I will send you once it is done. It will make
>> this
>>>>>>>>>> much easy to understand.
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 4) The ideas page say "Today, when a word is learned, varnam
>> takes
>>>>>> all
>>>>>>>>>> the
>>>>>>>>>>>>> possible prefixes into account". Prefixes? Shouldn't it be
>> suffixes?
>>>>>>>>>>
>>>>>>>>>> No it is prefixes. For example, when the word ?????? is learned,
>> varnam
>>>>>>>>>> learns the prefixes, ??, ???? etc. So when it gets a pattern like
>>>>>>>>>> "malayali", it can easily tokenize it rather than typing like
>>>>>> "malayaali".
>>>>>>>>>>
>>>>>>>>>> Suffixes won't help because tokenization is left to right. This is
>> where
>>>>>>>>>> another major improvement could be possible in varnam. If we can
>> come up
>>>>>>>>>> with tokeniation algorithm, which takes, prefixes, suffixes and
>> partial
>>>>>>>>>> matches into account, then we literally can transliterate any word.
>> But
>>>>>>>>>> its a hard problem which needs lots of research and effort. The
>> effort
>>>>>>>>>> will be doing it at a scale at which varnam is operating now. Today,
>>>>>>>>>> every key stroke that you make on the varnam editor, is searching
>> over 7
>>>>>>>>>> million patterns to predict the result. All this happens in less
>> than a
>>>>>>>>>> second. Improving tokenization and keeping the current performance
>> is a
>>>>>>>>>> *hard* problem.
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me try and coin a malayalam stemmer. I will post what I come
>> up
>>>>>> with
>>>>>>>>>>>>> here.
>>>>>>>>>>
>>>>>>>>>> That's great. Feel free to ask any questions. You are already asking
>>>>>>>>>> pretty good question. Good going.
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kevin Martin Jose
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
> 
>>>>
>>>>
>>>
> 
> 
> 

- -- 
Cheers,
Navaneeth
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org

iQEcBAEBCgAGBQJTHUWlAAoJEHFACYSL7h6klwkH/0Wcc3zwUbgnSBw/SEDg/wDh
rZNRr4JR0hHuMbfI+U+J+oulTzxX46HnxZyq7q6s7TeJt9J5CV5Uf+eUZxHiYv8h
Zbq6NwDqjkXCxI3tMQIEHr+Ixq+fUIUdRv1ZgEnjHZEqY5o6FDHUi2iEayht67EV
zEzk+sY/hBnS8JssvEFY9EI/MhpzPbgCoXKTtJn8p0RZc9gnehJ2qJyQDhlMDTwq
6dbaPSDVURMOEL8yYMBPy8eQ/f2Q5WteHgE6a5vx1ylcV8XLvdNEoN8J90Wqe6p5
82XxbVFj+k3N8gZlUoN2vTFGTI8ctWefRvrz855w+x5tOIdA+EMx4EyY7xFquA0=
=f5sD
-----END PGP SIGNATURE-----



reply via email to

[Prev in Thread] Current Thread [Next in Thread]