Re: [Varnamproject-discuss] Improving Varnam Learning

On Sun, Mar 9, 2014 at 1:56 PM, Navaneeth K N <address@hidden> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hello Kevin,

Thanks for the stemming rules. I didn't get time to review it
completely. But looks good so far.

On 3/9/14 12:10 PM, Kevin Martin wrote:
> A doubt regarding the vst file. It is an sqlite3 database file right? I
> could not open it with 'sqlite3 ml.vst'. And the scheme file will be
> compiled into a .vst file.

yes. It is a SQLite file. You should be able to open it with the sqlite
utility. A scheme file can be compiled into VST file using `varnamc`.

varnamc --compile schemes/your_scheme_file

For the following points, I will send you a detailed email later today.

I would like to try out my stemming rules (check
> attachments). Here's how I assume I should proceed :
>
> 1. Write the stemming rules into the scheme file.
>
> 2. Compile the scheme file. For this, stem(pattern) should match a
> corresponding function right? Where should I specify that function? Which
> file specifies how the scheme file should be compiled?
>
> 3. For testing, I'd like to input a word, have it transliterated (using
> varnam_transliterate), and THEN stemmed. This stemmed word is displayed and
> then passed to varnam_train so as that particular pattern is always matched
> to that word. And what is the difference between varnam_learn and
> varnam_train?
>
>
> Also, I'm drafting a definitions_list - a file containing location of the
> definitions of a structure/function and what it does. It will not be proper
> documentation though. I have attached a sample with this mail. Its really
> helping me because after every 15 minutes I'll forget where a particular
> structure/function was defined and then I'll start searching all the source
> codes. If you'd like to have this list I'll finish it soon and submit a PR.
>
> Thank you for your time,
>
> Kevin Martin Jose
>
>
>
> On Sat, Mar 8, 2014 at 4:57 PM, Kevin Martin <address@hidden>wrote:
>
>> I have drafted a set of stemming rules. The file is attached with this
>> post. Please go through it.
>>
>> You were right in that it is impossible to achieve 100% stemming. I took a
>> malayalam paragraph and tried stemming the words. The main problem is that
>> in malayalam many words are compounded together and thus is difficult to
>> segregate. Also, the stemming rules I have provided does not mention any
>> specific order. Those rules will have to be applied in a specific order to
>> stem a given word. The English stemmer could do it without recursion, and I
>> think the malayalam stemmer could too - with the right ordering.
>>
>> There's a number assigned to each rule - the line number. So rule 3 refers
>> to the statement written in line 3. I have tried to provide examples where
>> ever it seemed necessary.
>>
>>
>> On Fri, Mar 7, 2014 at 10:48 PM, Kevin Martin Jose <
>> address@hidden> wrote:
>>
>>> Thanks a lot.

>>> ------------------------------
>>> From: Navaneeth K N <address@hidden>
>>> Sent: ?07-?03-?2014 09:56

>>> To: address@hidden
>>> Subject: Re: [Varnamproject-discuss] Improving Varnam Learning
>>>

> Hello Kevin,
>
> On 3/5/14 12:12 AM, Kevin Martin wrote:
>>>>> I went through the vst_tokenize() function. To my disappointment,
>>>>> understanding it was not as easy as I thought. I wrestled for a few
> hours
>>>>> with code and decided that I need to assimilate a few key concepts
> before I
>>>>> can understand what vst_tokenize does.
>>>>>
>>>>> 1. What is a vpool? Why is it needed? I read its definition but I do not
>>>>> understand its purpose or how it is used. Is it a pool of free varrays?
>>>>> To be more specific, I would like to know the purpose of elements
> like
>>>>> v_->strings_pool. What does the function get_pooled_string()
>>>>> return?
>
> It is object pooling, a technique to reuse already allocated objects
> rather than keep on reallocating them. This improves performance over
> time because pool is destroyed only after the handle is destroyed.
> Mostly you will have the handle available throughout the application.
>
> get_pooled_string() returns a strbuf, a dynamically growing string type.
>
>>>>>
>>>>> 2.What is a vcache_entry? What is the purpose of the strbuf 'cache_key'
> in
>>>>> vst_tokenize()? What are the contents of a vcache?
>
> Vcache is a hashtable. This is another optimization technique to reuse
> already tokenized word. For eg: when "malayalam" is transliterated,
> tokenization happens and cache gets filled with tokens. When it is
> transliterated again, tokenization will just use the cache and won't
> touch the disk. This improves performance dramatically.
>
>>>>>
>>>>> 3. What is the purpose of int tokenize_using and int match_type, the
>>>>> parameters of vst_tokenize()?
>
> A tokenization can be of two types, pattern tokenization and value
> tokenization. Pattern tokenization is about tokenizing words which you
> will send for transliteration. Value tokenization is on the indic text.
>
> tokenize ("malayalam") = pattern tokenization

> tokenize ("??????") = value tokenization

>
> To understand how tokenization works, you can use the `print-tokens`
> tool available in the `tools` directory. It is not compiled usually. You
> need to pass `-DBUILD_TOOLS=true` when doing `cmake .` to get it compiled.
>
>>>>>
>>>>> 4. Assume that a malayalam stemmer ml_stemmer() has been implemented.
> Will
>>>>> it replace vst_tokenize() or will the line :
>>>>>
>>>>> base=ml_stemmer(input)
>>>>>
>>>>> be inside the vst_tokenize() function? The answer to this question
> must
>>>>> be pretty straight forward but I cannot see it since I do not
>>>>> understand vst_tokenize() yet.
>
> Stemmer won't have connection to tokenization. Stemmer will be part of
> learning subsystem. So `varnam_learn()` function will use it.
>
> Also stemmer has to be configurable for each language. You need to add a
> new function to the scheme file compiler so that you can do something
> like the following in each scheme file.
>

> stem ("????", "?")

>
> This rule needs to be compiled into `vst` file and during learn it
> should be utilized to do the stemming.
>
> We may also need to fix how varnam combines two tokens. Currently, when
> a consonant and a vowel comes together, varnam will render the
> consonant-vowel form. But this is very basic and won't work for some
> conditions where chill letters are involved. I will think about this and
> draft the idea.
>
>
>>>>>
>>>>>
>>>>> On Tue, Mar 4, 2014 at 9:56 PM, Kevin Martin <
> address@hidden>wrote:
>>>>>
>>>>>> Thank you. I have a much better idea now. Another clarification needed
> :
>>>>>>

>>>>>> stem(????????????? )= ???????? or ????????
>>>>>>
>>>>>> Even though stemming it to ???????? makes more sense in malayalam, it

>>>>>> would be clearer to stem 'thozhilalikalude' to 'thozhilal' (without the

>>>>>> trailing 'i') in English. Hence IMO ??????? would be a better base word
>>>>>> than ????????. But the examples you provided in the previous mail

> [given
>>>>>> below] would hold.
>>>>>>
>>>>>> [Examples from previous mail]
>>>>>>
>>>>>>

>>>>>>> stem(??????) = ???
>>>>>>> stem(??????????) = ??????
>>>>>>> stem(??????????????) = ???????????

>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 4, 2014 at 10:12 AM, Navaneeth K N <address@hidden> wrote:
>>>>>>
>>>>> Hello Kevin,
>>>>>
>>>>> Good to see that you are making progress.
>>>>>
>>>>> On 3/3/14 12:58 PM, Kevin Martin wrote:

>>>>>>>>>> No it is prefixes. For example, when the word ?????? is learned,
> varnam
>>>>>>>>>> learns the prefixes, ??, ???? etc. So when it gets a pattern like

>>>>>>>>>> "malayali", it can easily tokenize it rather than typing like
>>>>> "malayaali".
>>>>>>>>>
>>>>>>>>> 1.What do you mean by tokenization? A token is a pattern to symbol
>>>>> mapping.
>>>>>>>>> So tokenization means matching the entire word to its malayalam
> symbol?
>>>>>
>>>>> A tokenization is splitting the input into multiple tokens. For eg:
>>>>>
>>>>> input - malayalam
>>>>> tokens - [[ma], [la], [ya], [lam]]
>>>>>
>>>>> Each will be a `vtoken` instance with relevant attributes set. For the
>>>>> token `ma`, it will be marked as a consonant.
>>>>>
>>>>> Tokenization happens left-right. It is a greedy tokenizer which find the
>>>>> longest possible match. Look at `vst_tokenize` function to learn how it
>>>>> works.
>>>>>
>>>>>>>>>
>>>>>>>>> 2. The porter stemmer stems the given English word to a base word by
>>>>>>>>> stripping it off all the suffixes. How can we stem a malayalam word?

>>>>>>>>> Suppose that varnam is encountering the word ?????? for the first

> time.
>>>>> The
>>>>>>>>> input was 'malayalam'. In this case, as of now, varnam learns to map
>>>>> 'mala'

>>>>>>>>> to ??, 'malaya' to ???? and so on? Hence learning a word makes

> varnam
>>>>> learn
>>>>>>>>> the mappings for all its prefixes, right?
>>>>>
>>>>> Something like the following:
>>>>>

>>>>> stem(??????) = ???
>>>>> stem(??????????) = ??????
>>>>> stem(??????????????) = ???????????

>>>>>
>>>>>
>>>>>>>>>
>>>>>>>>> 3. Let me propose a stemmer that rips off suffixes. Consider the
> word

>>>>>>>>> ?????? (malayalam) that was learned by varnam.
>>>>>>>>> I think the goal of the stemmer should be to get the base word ?????
>>>>>>>>> (malayal) rather than ????. To do this, I think we will need to

> compare
>>>>> the
>>>>>>>>> obtained base word with the original word. Let us assume that the
>>>>> stemming
>>>>>>>>> algorithm got the base word 'malayal' from 'malayalam'. We can make
> sure

>>>>>>>>> that this is mapped to ????? rather than ???? by ripping off the

>>>>> equivalent
>>>>>>>>> suffix from the malayalam transliteration word. That is,
>>>>>>>>>

>>>>>>>>> removing the suffix 'am' from 'malayalam' removes the ? from
> '??????'.

>>>>> For
>>>>>>>>> this, 'am' needs should have been matched with ? in the scheme file.
>>>>> Hence

>>>>>>>>> we would get ????? for 'malayal' and this can be learned. This would
>>>>> result
>>>>>>>>> in the easier mapping of malayali to ?????? .
>>>>>>>>>
>>>>>>>>> Another example :
>>>>>>>>>
>>>>>>>>> thozhilalikalude is ?????????????

>>>>>>>>>
>>>>>>>>> a).sending 'thozhilalikalude' to the stemmer, we obtain
> 'thozhilalikal'
>>>>> in

>>>>>>>>> step 1. As a corresponding step ? ?? is removed from ?????????????
> and
>>>>>>>>> results in ??????????. No learning occurs in this step because we

> have
>>>>> not
>>>>>>>>> reached the base word yet.

>>>>>>>>> b) 'thozhilalikal' is stemmed to 'thozhilali' - ?? is removed from
>>>>>>>>> ??????????. Even though 'kal', the suffix that was removed, could be
>>>>>>>>> matched to ??, we do not do that because the word before stemming
> had
>>>>> ?.
>>>>>>>>> Produces ???????? .
>>>>>>>>> c) thozhilali is stemmed to thozhilal - Produces ??????? from
> ????????.

>>>>>>>>> This base word and the corresponding malayalam mapping is learned by
>>>>> varnam.
>>>>>>>>>
>>>>>>>>> I have not completed drafting the malayalam stemmer algorithm. It
> seems
>>>>> to
>>>>>>>>> have many more condition checks than I had anticipated and could
> end up
>>>>>>>>> being larger and more complicated than the porter stemmer. But
> before I
>>>>>>>>> proceed, I need to know whether the logic I presented above is
> correct.
>>>>>
>>>>> You are on the right direction.
>>>>>
>>>>> Stemming in Indian languages is really complex because of the way we
>>>>> write words. So don't worry about getting 100% stemming. IMO, that is
>>>>> impossible to achieve. So target for a stemming rules which will
>>>>> probably give you more than 60-70% of success rate.
>>>>>
>>>>> We should make this stemming rules configurable in the scheme file. So
>>>>> in the malayalam scheme file, you define,
>>>>>
>>>>> stem(a) = b
>>>>>
>>>>> this gets compiled into the `vst` file and during runtime, `libvarnam`
>>>>> will read the stemming rule from the `vst` file and apply it to the
>>>>> target word.
>>>>>
>>>>> As part of this, we also need to implement a sort conjunct rule to
>>>>> `libvarnam` so that it know how to combine base form and a vowel. Dont'
>>>>> worry about this now. We will deal with it later.
>>>>>
>>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>>
>>>>>>>>> Kevin Martin Jose
>>>>>>>>>
>>>>>>>>> On Fri, Feb 28, 2014 at 7:50 PM, Navaneeth K N <address@hidden>
> wrote:
>>>>>>>>>
>>>>>>>>> Hello Kevin,
>>>>>>>>>
>>>>>>>>> On 2/28/14 12:43 PM, Kevin Martin wrote:
>>>>>>>>>>>> I'm seeking to improve varnam's learning capabilities as a GSoC
>>>>> project.
>>>>>>>>>>>> I've gone through the source code and I have doubts. I need to
>>>>> clarify if
>>>>>>>>>>>> my line of thinking is right. Please have a look :
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Token : A token is an indivisible word. A token is the basic
>>>>> building
>>>>>>>>>>>> block. 'tokens' is an object (instance? I mean the non-OOP
>>>>> equivalent of
>>>>>>>>> an
>>>>>>>>>>>> object) of the type varray. 'tokens' contain all the possible
>>>>> patterns
>>>>>>>>> of a

>>>>>>>>>>>> token? For example, ?????? ????????????? ?????????? ????? would

> all
>>>>> go
>>>>>>>>>>>> under the same varray instance 'tokens'?. And each word ( for eg

>>>>> ?????? )

>>>>>>>>>>>> would occupy a slot at tokens->memory I suppose. Am I right in
> this
>>>>>>>>> regard?
>>>>>>>>>
>>>>>>>>> No.
>>>>>>>>>

>>>>>>>>> In ??????, ? will be a token. `varray` is a generic datastructure

> that
>>>>>>>>> can keep any elements and grow the storage as required. So

>>>>>>>>> `tokens->memory` will have the following tokens, ?, ?, ??, ??. Each

>>>>>>>>> token known about a pattern and a value.
>>>>>>>>>
>>>>>>>>> Look at the scheme file in "schemes/" directory. A token is a
>>>>>>>>> pattern-value mapping.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2) I see the data type 'v_' frequently used. However,I could not
>>>>> find its
>>>>>>>>>>>> definition! I missed it, of course. Running ctrl+f on a few
> source
>>>>> files
>>>>>>>>>>>> did not turn up the definitions. So I thought I would simply ask
>>>>> here! I
>>>>>>>>>>>> would be really grateful if you can tell me where it is defined
> and
>>>>> why
>>>>>>>>> it
>>>>>>>>>>>> is defined (what it does)
>>>>>>>>>
>>>>>>>>> That's a dirty hack. It's a define, done at[1]. It will get
> replaced as
>>>>>>>>> `handle->internal` by the compiler. It is just a shorthand for
>>>>>>>>> `handle->internal`. Not elegant, but got used to it. We will clean
> it up
>>>>>>>>> one day. Sorry for making the confusion.
>>>>>>>>>
>>>>>>>>> [1]:
>>>>>>>>>
>>>>>>>>>
>>>>>
> https://gitorious.org/varnamproject/libvarnam/source/68a17b6e2e5d114d6a606a9a47294917655a167f:util.h#L26
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 3) I read the porter stemmer algorithm. The ideas page say
>>>>> *"something
>>>>>>>>> like
>>>>>>>>>>>> a porter stemmer implementation but integrated into the varnam
>>>>> framework
>>>>>>>>> so
>>>>>>>>>>>> that new language support can be added easily"*. I really doubt
> if
>>>>>>>>>>>> implementing a porter stemmer would make adding new language
> support
>>>>> any
>>>>>>>>>>>> easier. The English stemmer is an improvised version of the
> original
>>>>>>>>> porter
>>>>>>>>>>>> stemmer. A stemming algorithm is specific to a particular
> language
>>>>> since
>>>>>>>>> it
>>>>>>>>>>>> deals with the suffixes that occur in that language. We need a
>>>>> malayalam
>>>>>>>>>>>> stemmer, and if we want to add support to say telugu one day, we
>>>>> would
>>>>>>>>> need
>>>>>>>>>>>> a telugu stemmer. We can of course write one stemmer and add test
>>>>> cases
>>>>>>>>> and
>>>>>>>>>>>> suffix condition checks in the new language so that tokenization
> can
>>>>> be
>>>>>>>>>>>> done with the same function call.
>>>>>>>>>
>>>>>>>>> When I said integrated into the framework, I mean make the stemmer
>>>>>>>>> configurable at a scheme file level. Basically the scheme file will
> have
>>>>>>>>> a way to define the stemming. Now when a new language is added,
> there
>>>>>>>>> will be a new scheme file and the stemming rules for that language
> goes
>>>>>>>>> to the appropriate scheme file. All varnam needs to know to properly
>>>>>>>>> evaluate those rules.
>>>>>>>>>
>>>>>>>>> I am in the process of writing some documentation explaining the
> scheme
>>>>>>>>> file and vst files. I will send you once it is done. It will make
> this
>>>>>>>>> much easy to understand.
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 4) The ideas page say "Today, when a word is learned, varnam
> takes
>>>>> all
>>>>>>>>> the
>>>>>>>>>>>> possible prefixes into account". Prefixes? Shouldn't it be
> suffixes?
>>>>>>>>>

>>>>>>>>> No it is prefixes. For example, when the word ?????? is learned,
> varnam
>>>>>>>>> learns the prefixes, ??, ???? etc. So when it gets a pattern like

>>>>>>>>> "malayali", it can easily tokenize it rather than typing like
>>>>> "malayaali".
>>>>>>>>>
>>>>>>>>> Suffixes won't help because tokenization is left to right. This is
> where
>>>>>>>>> another major improvement could be possible in varnam. If we can
> come up
>>>>>>>>> with tokeniation algorithm, which takes, prefixes, suffixes and
> partial
>>>>>>>>> matches into account, then we literally can transliterate any word.
> But
>>>>>>>>> its a hard problem which needs lots of research and effort. The
> effort
>>>>>>>>> will be doing it at a scale at which varnam is operating now. Today,
>>>>>>>>> every key stroke that you make on the varnam editor, is searching
> over 7
>>>>>>>>> million patterns to predict the result. All this happens in less
> than a
>>>>>>>>> second. Improving tokenization and keeping the current performance
> is a
>>>>>>>>> *hard* problem.
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Let me try and coin a malayalam stemmer. I will post what I come
> up
>>>>> with
>>>>>>>>>>>> here.
>>>>>>>>>
>>>>>>>>> That's great. Feel free to ask any questions. You are already asking
>>>>>>>>> pretty good question. Good going.
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Kevin Martin Jose
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>
>>>
>>>
>>
>

- --
Cheers,
Navaneeth
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org

iQEcBAEBCgAGBQJTHCWdAAoJEHFACYSL7h6kSl4H/2VMev9HyBV+ApIYw6UH7fsA
WEWmMxc5bU5/ePVvBHMLrm4sfUh5AZkPx/5hsqhr0Bz84Fh18M9qGjwjEttncTMk
WM4VxRInHSVE0xFwYuAq+Kr15bKYx7fpqoAKl/7A5ln0z2P7382G4rligASDP1t8
agmFVaVz8TErDuiKDPoePyvemezxUK0dSi0hLVc2yqmToe7STj6WktWgj2FLjBtU
2mBtA7J20iW481reWrCGWOgq7vONbwWdyL45GxSGiI5SR3DzP5fA/0/RkgKTSp9q
mS3RlKo+ejgMrcaaOkL+5WPBNORwvI39EvziT5LnTZSmRmaACOySmxhVL09H25E=
=eMH7
-----END PGP SIGNATURE-----

From:	Kevin Martin
Subject:	Re: [Varnamproject-discuss] Improving Varnam Learning
Date:	Sun, 9 Mar 2014 16:54:28 +0530