varnamproject-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Varnamproject-discuss] Fwd: added datuk word corpus and ml_IN.dict to v


From: kiran ps
Subject: [Varnamproject-discuss] Fwd: added datuk word corpus and ml_IN.dict to varnam
Date: Thu, 13 Mar 2014 21:55:45 +0530



---------- Forwarded message ----------
From: kiran ps <address@hidden>
Date: 13 March 2014 20:57
Subject: added datuk word corpus and ml_IN.dict to varnam
To: address@hidden


Currently there are 317169 word in varnam word corpus,i have managed to extract 78855 words from The Datuk word corpus published by olam. ml_IN.dict dictionary used by spellchecker, which has around 142591 word.Together 116461 words was new to varnam. Some of the words around 856 have brackets () and colon : in them, i think they belong to sanskrit so i added them in to another file.

no of words in varnam = 317169
no of words in olam = 78855
no of words in ml_IN.dict = 142591
olam ∩ ml_dict = 8583
new words to varnam = 116461

The varnam Corpus is based mainly on material collected from pages on the World Wide Web.By the use of synchronization tool we can upload the words from offline IMEs to the online repository more easily.I think the data we collected need to be reviewed.By doing so we can create a better corpus.The corpus will be helpful track and record the very latest developments in language today. By analyzing the corpus and using special software, we can see words in context and find out how new words and senses are emerging, as well as spotting other trends in usage, spelling and so on.The corpus will help to create a better dictionary.The spellchecker that we are currently using have only 150000 while varnam has 400000 +. The corpus will be helpul almost every projects that we have.

Attachments

newtovarnam - words new to varnam
varnam - varnam word corpus
datukextracted - words extracted from datuk
datuk brackets - words having brackets
datuk colon - words having colon
datuk - datuk corpus
ml_IN.dict - malayam dictionary used in spellcheckers





reply via email to

[Prev in Thread] Current Thread [Next in Thread]