varnamproject-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Varnamproject-discuss] Improving Varnam Learning


From: Kevin Martin
Subject: Re: [Varnamproject-discuss] Improving Varnam Learning
Date: Sat, 8 Mar 2014 16:57:29 +0530

I have drafted a set of stemming rules. The file is attached with this post. Please go through it.

You were right in that it is impossible to achieve 100% stemming. I took a malayalam paragraph and tried stemming the words. The main problem is that in malayalam many words are compounded together and thus is difficult to segregate. Also, the stemming rules I have provided does not mention any specific order. Those rules will have to be applied in a specific order to stem a given word. The English stemmer could do it without recursion, and I think the malayalam stemmer could too - with the right ordering.

There's a number assigned to each rule - the line number. So rule 3 refers to the statement written in line 3. I have tried to provide examples where ever it seemed necessary.


On Fri, Mar 7, 2014 at 10:48 PM, Kevin Martin Jose <address@hidden> wrote:
Thanks a lot.

From: Navaneeth K N
Sent: ‎07-‎03-‎2014 09:56
To: address@hidden
Subject: Re: [Varnamproject-discuss] Improving Varnam Learning

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hello Kevin,

On 3/5/14 12:12 AM, Kevin Martin wrote:
> I went through the vst_tokenize() function. To my disappointment,
> understanding it was not as easy as I thought. I wrestled for a few hours
> with code and decided that I need to assimilate a few key concepts before I
> can understand what vst_tokenize does.
>
> 1. What is a vpool? Why is it needed? I read its definition but I do not
> understand its purpose or how it is used. Is it a pool of free varrays?
>    To be more specific, I would like to know the purpose of elements like
> v_->strings_pool. What does the function get_pooled_string()
>    return?

It is object pooling, a technique to reuse already allocated objects
rather than keep on reallocating them. This improves performance over
time because pool is destroyed only after the handle is destroyed.
Mostly you will have the handle available throughout the application.

get_pooled_string() returns a strbuf, a dynamically growing string type.

>
> 2.What is a vcache_entry? What is the purpose of the strbuf 'cache_key' in
> vst_tokenize()? What are the contents of a vcache?

Vcache is a hashtable. This is another optimization technique to reuse
already tokenized word. For eg: when "malayalam" is transliterated,
tokenization happens and cache gets filled with tokens. When it is
transliterated again, tokenization will just use the cache and won't
touch the disk. This improves performance dramatically.

>
> 3. What is the purpose of int tokenize_using and int match_type, the
> parameters of vst_tokenize()?

A tokenization can be of two types, pattern tokenization and value
tokenization. Pattern tokenization is about tokenizing words which you
will send for transliteration. Value tokenization is on the indic text.

tokenize ("malayalam") = pattern tokenization
tokenize ("മലയാളം") = value tokenization

To understand how tokenization works, you can use the `print-tokens`
tool available in the `tools` directory. It is not compiled usually. You
need to pass `-DBUILD_TOOLS=true` when doing `cmake .` to get it compiled.

>
> 4. Assume that a malayalam stemmer ml_stemmer() has been implemented. Will
> it replace vst_tokenize() or will the line :
>
>                           base=ml_stemmer(input)
>
>     be inside the vst_tokenize() function? The answer to this question must
> be pretty straight forward but I cannot see it since I do not
>     understand vst_tokenize() yet.

Stemmer won't have connection to tokenization. Stemmer will be part of
learning subsystem. So `varnam_learn()` function will use it.

Also stemmer has to be configurable for each language. You need to add a
new function to the scheme file compiler so that you can do something
like the following in each scheme file.

stem ("ളുടെ", "ൾ")

This rule needs to be compiled into `vst` file and during learn it
should be utilized to do the stemming.

We may also need to fix how varnam combines two tokens. Currently, when
a consonant and a vowel comes together, varnam will render the
consonant-vowel form. But this is very basic and won't work for some
conditions where chill letters are involved. I will think about this and
draft the idea.


>
>
> On Tue, Mar 4, 2014 at 9:56 PM, Kevin Martin <address@hidden>wrote:
>
>> Thank you. I have a much better idea now. Another clarification needed :
>>
>> stem(തൊഴിലാളികളുടെ )= തൊഴിലാളി or തൊഴിലാള?
>>
>> Even though stemming it to തൊഴിലാളി makes more sense in malayalam, it
>> would be clearer to stem 'thozhilalikalude' to 'thozhilal' (without the
>> trailing 'i') in English. Hence IMO തൊഴിലാള would be a better base word
>> than തൊഴിലാളി. But the examples you provided in the previous mail [given
>> below] would hold.
>>
>> [Examples from previous mail]
>>
>>
>>> stem(അവളുടെ) = അവൾ
>>> stem(കാറ്റിന്റെ) = കാറ്റ്
>>> stem(ഡോ‍ക്ക്റ്ററുടെ) = ഡോ‍ക്ക്റ്റർ
>>
>>
>>
>>
>> On Tue, Mar 4, 2014 at 10:12 AM, Navaneeth K N <address@hidden> wrote:
>>
> Hello Kevin,
>
> Good to see that you are making progress.
>
> On 3/3/14 12:58 PM, Kevin Martin wrote:
>>>>>> No it is prefixes. For example, when the word മലയാളം is learned, varnam
>>>>>> learns the prefixes, മല, മലയാ etc. So when it gets a pattern like
>>>>>> "malayali", it can easily tokenize it rather than typing like
> "malayaali".
>>>>>
>>>>> 1.What do you mean by tokenization? A token is a pattern to symbol
> mapping.
>>>>> So tokenization means matching the entire word to its malayalam symbol?
>
> A tokenization is splitting the input into multiple tokens. For eg:
>
> input - malayalam
> tokens - [[ma], [la], [ya], [lam]]
>
> Each will be a `vtoken` instance with relevant attributes set. For the
> token `ma`, it will be marked as a consonant.
>
> Tokenization happens left-right. It is a greedy tokenizer which find the
> longest possible match. Look at `vst_tokenize` function to learn how it
> works.
>
>>>>>
>>>>> 2. The porter stemmer stems the given English word to a base word by
>>>>> stripping it off all the suffixes. How can we stem a malayalam word?
>>>>> Suppose that varnam is encountering the word മലയാളം for the first time.
> The
>>>>> input was 'malayalam'. In this case, as of now, varnam learns to map
> 'mala'
>>>>> to മല, 'malaya' to മലയാ and so on? Hence learning a word makes varnam
> learn
>>>>> the mappings for all its prefixes, right?
>
> Something like the following:
>
> stem(അവളുടെ) = അവൾ
> stem(കാറ്റിന്റെ) = കാറ്റ്
> stem(ഡോ‍ക്ക്റ്ററുടെ) = ഡോ‍ക്ക്റ്റർ
>
>
>>>>>
>>>>> 3. Let me propose a stemmer that rips off suffixes. Consider the word
>>>>> മലയാളം (malayalam) that was learned by varnam.
>>>>> I think the goal of the stemmer should be to get the base word മലയാള
>>>>> (malayal) rather than മലയൽ. To do this, I think we will need to compare
> the
>>>>> obtained base word with the original word. Let us assume that the
> stemming
>>>>> algorithm got the base word 'malayal' from 'malayalam'. We can make sure
>>>>> that this is mapped to മലയാള rather than മലയൽ by ripping off the
> equivalent
>>>>> suffix from the malayalam transliteration word. That is,
>>>>>
>>>>> removing the suffix 'am' from 'malayalam' removes the ം from 'മലയാളം'.
> For
>>>>> this, 'am' needs should have been matched with ം in the scheme file.
> Hence
>>>>> we would get മലയാള for 'malayal' and this can be learned. This would
> result
>>>>> in the easier mapping of malayali to മലയാളി .
>>>>>
>>>>> Another example :
>>>>>
>>>>> thozhilalikalude is തൊഴിലാളികളുടെ
>>>>>
>>>>> a).sending 'thozhilalikalude' to the stemmer, we obtain 'thozhilalikal'
> in
>>>>> step 1. As a corresponding step  ു ടെ is removed from തൊഴിലാളികളുടെ and
>>>>> results in തൊഴിലാളികള. No learning occurs in this step because we have
> not
>>>>> reached the base word yet.
>>>>> b) 'thozhilalikal' is stemmed to 'thozhilali' - കള is removed from
>>>>> തൊഴിലാളികള. Even though 'kal', the suffix that was removed, could be
>>>>> matched to കൽ, we do not do that because the word before stemming had
>  ള.
>>>>> Produces തൊഴിലാളി .
>>>>> c) thozhilali is stemmed to thozhilal - Produces തൊഴിലാള from തൊഴിലാളി.
>>>>> This base word and the corresponding malayalam mapping is learned by
> varnam.
>>>>>
>>>>> I have not completed drafting the malayalam stemmer algorithm. It seems
> to
>>>>> have many more condition checks than I had anticipated and could end up
>>>>> being larger and more complicated than the porter stemmer. But before I
>>>>> proceed, I need to know whether the logic I presented above is correct.
>
> You are on the right direction.
>
> Stemming in Indian languages is really complex because of the way we
> write words. So don't worry about getting 100% stemming. IMO, that is
> impossible to achieve. So target for a stemming rules which will
> probably give you more than 60-70% of success rate.
>
> We should make this stemming rules configurable in the scheme file. So
> in the malayalam scheme file, you define,
>
>         stem(a) = b
>
> this gets compiled into the `vst` file and during runtime, `libvarnam`
> will read the stemming rule from the `vst` file and apply it to the
> target word.
>
> As part of this, we also need to implement a sort conjunct rule to
> `libvarnam` so that it know how to combine base form and a vowel. Dont'
> worry about this now. We will deal with it later.
>
>>>>>
>>>>> regards,
>>>>>
>>>>> Kevin Martin Jose
>>>>>
>>>>> On Fri, Feb 28, 2014 at 7:50 PM, Navaneeth K N <address@hidden> wrote:
>>>>>
>>>>> Hello Kevin,
>>>>>
>>>>> On 2/28/14 12:43 PM, Kevin Martin wrote:
>>>>>>>> I'm seeking to improve varnam's learning capabilities as a GSoC
> project.
>>>>>>>> I've gone through the source code and I have doubts. I need to
> clarify if
>>>>>>>> my line of thinking is right. Please have a look :
>>>>>>>>
>>>>>>>> 1) Token : A token is an indivisible word. A token is the basic
> building
>>>>>>>> block. 'tokens' is an object (instance? I mean the non-OOP
> equivalent of
>>>>> an
>>>>>>>> object) of the type varray. 'tokens' contain all the possible
> patterns
>>>>> of a
>>>>>>>> token? For example, മലയാളം മലയാളത്തിന്റെ മലയാളത്തിൽ മലയാള would all
> go
>>>>>>>> under the same varray instance 'tokens'?. And each word ( for eg
> മലയാളം )
>>>>>>>> would occupy a slot at tokens->memory I suppose. Am I right in this
>>>>> regard?
>>>>>
>>>>> No.
>>>>>
>>>>> In മലയാളം, മ will be a token. `varray` is a generic datastructure that
>>>>> can keep any elements and grow the storage as required. So
>>>>> `tokens->memory` will have the following tokens, മ, ല, യാ, ളം. Each
>>>>> token known about a pattern and a value.
>>>>>
>>>>> Look at the scheme file in "schemes/" directory. A token is a
>>>>> pattern-value mapping.
>>>>>
>>>>>
>>>>>>>>
>>>>>>>> 2) I see the data type 'v_' frequently used. However,I could not
> find its
>>>>>>>> definition! I missed it, of course. Running ctrl+f on a few source
> files
>>>>>>>> did not turn up the definitions. So I thought I would simply ask
> here! I
>>>>>>>> would be really grateful if you can tell me where it is defined and
> why
>>>>> it
>>>>>>>> is defined (what it does)
>>>>>
>>>>> That's a dirty hack. It's a define, done at[1]. It will get replaced as
>>>>> `handle->internal` by the compiler. It is just a shorthand for
>>>>> `handle->internal`. Not elegant, but got used to it. We will clean it up
>>>>> one day. Sorry for making the confusion.
>>>>>
>>>>> [1]:
>>>>>
>>>>>
> https://gitorious.org/varnamproject/libvarnam/source/68a17b6e2e5d114d6a606a9a47294917655a167f:util.h#L26
>>>>>
>>>>>>>>
>>>>>>>> 3) I read the porter stemmer algorithm. The ideas page say
> *"something
>>>>> like
>>>>>>>> a porter stemmer implementation but integrated into the varnam
> framework
>>>>> so
>>>>>>>> that new language support can be added easily"*. I really doubt if
>>>>>>>> implementing a porter stemmer would make adding new language support
> any
>>>>>>>> easier. The English stemmer is an improvised version of the original
>>>>> porter
>>>>>>>> stemmer. A stemming algorithm is specific to a particular language
> since
>>>>> it
>>>>>>>> deals with the suffixes that occur in that language. We need a
> malayalam
>>>>>>>> stemmer, and if we want to add support to say telugu one day, we
> would
>>>>> need
>>>>>>>> a telugu stemmer. We can of course write one stemmer and add test
> cases
>>>>> and
>>>>>>>> suffix condition checks in the new language so that tokenization can
> be
>>>>>>>> done with the same function call.
>>>>>
>>>>> When I said integrated into the framework, I mean make the stemmer
>>>>> configurable at a scheme file level. Basically the scheme file will have
>>>>> a way to define the stemming. Now when a new language is added, there
>>>>> will be a new scheme file and the stemming rules for that language goes
>>>>> to the appropriate scheme file. All varnam needs to know to properly
>>>>> evaluate those rules.
>>>>>
>>>>> I am in the process of writing some documentation explaining the scheme
>>>>> file and vst files. I will send you once it is done. It will make this
>>>>> much easy to understand.
>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 4) The ideas page say "Today, when a word is learned, varnam takes
> all
>>>>> the
>>>>>>>> possible prefixes into account". Prefixes? Shouldn't it be suffixes?
>>>>>
>>>>> No it is prefixes. For example, when the word മലയാളം is learned, varnam
>>>>> learns the prefixes, മല, മലയാ etc. So when it gets a pattern like
>>>>> "malayali", it can easily tokenize it rather than typing like
> "malayaali".
>>>>>
>>>>> Suffixes won't help because tokenization is left to right. This is where
>>>>> another major improvement could be possible in varnam. If we can come up
>>>>> with tokeniation algorithm, which takes, prefixes, suffixes and partial
>>>>> matches into account, then we literally can transliterate any word. But
>>>>> its a hard problem which needs lots of research and effort. The effort
>>>>> will be doing it at a scale at which varnam is operating now. Today,
>>>>> every key stroke that you make on the varnam editor, is searching over 7
>>>>> million patterns to predict the result. All this happens in less than a
>>>>> second. Improving tokenization and keeping the current performance is a
>>>>> *hard* problem.
>>>>>
>>>>>>>>
>>>>>>>> Let me try and coin a malayalam stemmer. I will post what I come up
> with
>>>>>>>> here.
>>>>>
>>>>> That's great. Feel free to ask any questions. You are already asking
>>>>> pretty good question. Good going.
>>>>>
>>>>>>>>
>>>>>>>> regards,
>>>>>>>>
>>>>>>>> Kevin Martin Jose
>>>>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>
>
>>>
>>>
>>
>

- --
Cheers,
Navaneeth
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org

iQEcBAEBCgAGBQJTGUp4AAoJEHFACYSL7h6k5QwIAKtEmuDtdSa1HdoOnnnR9OY7
RyZPdxxFQgA025L9KwTrQpf+M4IgGO6p68m/OG2EUceDwY+mkBo0fylp/AiUq9hy
zDrY/c/bHTDyvZmBPvss/bet0f3NgFm+HMQtJOCUViSwNE2q1bjyMcQrhUJBuAix
nYKh4ox0WdRq8g5bZGpnVQ9OOPoPIMpJkMBJsD+NNsVGtlr/WJD898KucfUTcG20
+sX4SrGYiCqukb6SNrGAQcBJ/auQ3Un3ny80FVPJHiHwmBauqpR71is0S/aWRlag
puEhoNcACdRqhytBwzW5LbYmnbWj+5nwFt6kFUObghP2X6sWUs5yCCqM/q9IaFw=
=iHAc
-----END PGP SIGNATURE-----


Attachment: stemmer_rules
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]