varnamproject-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Varnamproject-discuss] Improving Varnam Learning


From: Navaneeth K N
Subject: Re: [Varnamproject-discuss] Improving Varnam Learning
Date: Fri, 28 Feb 2014 19:50:05 +0530
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:24.0) Gecko/20100101 Thunderbird/24.3.0

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hello Kevin,

On 2/28/14 12:43 PM, Kevin Martin wrote:
> I'm seeking to improve varnam's learning capabilities as a GSoC project.
> I've gone through the source code and I have doubts. I need to clarify if
> my line of thinking is right. Please have a look :
> 
> 1) Token : A token is an indivisible word. A token is the basic building
> block. 'tokens' is an object (instance? I mean the non-OOP equivalent of an
> object) of the type varray. 'tokens' contain all the possible patterns of a
> token? For example, മലയാളം മലയാളത്തിന്റെ മലയാളത്തിൽ മലയാള would all go
> under the same varray instance 'tokens'?. And each word ( for eg മലയാളം )
> would occupy a slot at tokens->memory I suppose. Am I right in this regard?

No.

In മലയാളം, മ will be a token. `varray` is a generic datastructure that
can keep any elements and grow the storage as required. So
`tokens->memory` will have the following tokens, മ, ല, യാ, ളം. Each
token known about a pattern and a value.

Look at the scheme file in "schemes/" directory. A token is a
pattern-value mapping.


> 
> 2) I see the data type 'v_' frequently used. However,I could not find its
> definition! I missed it, of course. Running ctrl+f on a few source files
> did not turn up the definitions. So I thought I would simply ask here! I
> would be really grateful if you can tell me where it is defined and why it
> is defined (what it does)

That's a dirty hack. It's a define, done at[1]. It will get replaced as
`handle->internal` by the compiler. It is just a shorthand for
`handle->internal`. Not elegant, but got used to it. We will clean it up
one day. Sorry for making the confusion.

[1]:
https://gitorious.org/varnamproject/libvarnam/source/68a17b6e2e5d114d6a606a9a47294917655a167f:util.h#L26

> 
> 3) I read the porter stemmer algorithm. The ideas page say *"something like
> a porter stemmer implementation but integrated into the varnam framework so
> that new language support can be added easily"*. I really doubt if
> implementing a porter stemmer would make adding new language support any
> easier. The English stemmer is an improvised version of the original porter
> stemmer. A stemming algorithm is specific to a particular language since it
> deals with the suffixes that occur in that language. We need a malayalam
> stemmer, and if we want to add support to say telugu one day, we would need
> a telugu stemmer. We can of course write one stemmer and add test cases and
> suffix condition checks in the new language so that tokenization can be
> done with the same function call.

When I said integrated into the framework, I mean make the stemmer
configurable at a scheme file level. Basically the scheme file will have
a way to define the stemming. Now when a new language is added, there
will be a new scheme file and the stemming rules for that language goes
to the appropriate scheme file. All varnam needs to know to properly
evaluate those rules.

I am in the process of writing some documentation explaining the scheme
file and vst files. I will send you once it is done. It will make this
much easy to understand.

> 
> 
> 4) The ideas page say "Today, when a word is learned, varnam takes all the
> possible prefixes into account". Prefixes? Shouldn't it be suffixes?

No it is prefixes. For example, when the word മലയാളം is learned, varnam
learns the prefixes, മല, മലയാ etc. So when it gets a pattern like
"malayali", it can easily tokenize it rather than typing like "malayaali".

Suffixes won't help because tokenization is left to right. This is where
another major improvement could be possible in varnam. If we can come up
with tokeniation algorithm, which takes, prefixes, suffixes and partial
matches into account, then we literally can transliterate any word. But
its a hard problem which needs lots of research and effort. The effort
will be doing it at a scale at which varnam is operating now. Today,
every key stroke that you make on the varnam editor, is searching over 7
million patterns to predict the result. All this happens in less than a
second. Improving tokenization and keeping the current performance is a
*hard* problem.

> 
> Let me try and coin a malayalam stemmer. I will post what I come up with
> here.

That's great. Feel free to ask any questions. You are already asking
pretty good question. Good going.

> 
> regards,
> 
> Kevin Martin Jose
> 

- -- 
Cheers,
Navaneeth
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org

iQEcBAEBCgAGBQJTEJsVAAoJEHFACYSL7h6kpQsH/3NjXOPY1vGZD868bz4Zoudm
3gzlbtSqiGU6szTBwfp91vN1W3WMkuXLeJR3rhvKErqfmA/WlBCzNG9dE4Pt2ATh
R/G+aOINt8zWEeZQcETwuWpsWolZ5xnWTnTvw8hEiVm0RjI1+havVgTm038hQKMT
PfbXFwyARln1x7Z/YoPMO7dK+E0aN4ASxQxpM5iIVtVFcoyT9slweh4gWPiJ7svG
tcikPJR7aRSu1urwUN6keO2ytVVEG0dIUDcQ0+gFQzzh+N9n6BHQhcZb30uHPfBU
+/SVfErELVDpePe2oMczPd3IKf/vz6Izi6ZsMeuhOKqE/V8nlXrKtZ/cynkqnGQ=
=38qz
-----END PGP SIGNATURE-----



reply via email to

[Prev in Thread] Current Thread [Next in Thread]