bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34862: 27.0.50; Trying to update pinyin.map


From: Eli Zaretskii
Subject: bug#34862: 27.0.50; Trying to update pinyin.map
Date: Wed, 20 Mar 2019 11:45:30 +0200

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Fri, 15 Mar 2019 11:31:40 -0700
> 
> > That file is imported from an external source, isn't it?  Are you
> > saying we should stop synchronizing it with that source, and instead
> > fork it, maintain our own separate copy, and never resync with that
> > source again?  If so, then I see no reason not to recode it in UTF-8.
> 
> Near as I can tell that file was imported into Emacs in 2001 and not
> touched since (apart from copyright and encoding stuff). The Debian
> package from which it comes seems to have been orphaned in 2003[1]. So
> there's not much to either synchronize or fork!

OK, sounds reasonable.

> > Btw, I understand that the Google pinyin method is Apache licensed,
> > but does this mean we can freely use its data for updating pinyin.map?
> > IANAL.  Could you perhaps describe how you intend to extract the data
> > from the Google input method for the purpose of updating our file?  I
> > think someone will have to audit that process for being legal and
> > compatible with both the Apache license and the GPL.
> 
> This[2] is the source file I used. I chopped off all the
> multiple-character dictionary entries, and munged the remaining data
> into the format we need. Ie, lines like this:
> 
> 八 6677.54934466 0 ba
> 把 165484.231697 0 ba
> 吧 385205.434615 0 ba
> 
> Became this:
> 
> ba 吧把八
> 
> A straight rearrangement, with frequency of use translated into simple
> ordering of the characters. While this is obviously pretty manual, and a
> bit of work, a file like this really only needs to be updated every five
> years or so -- if that. Whenever someone thinks of it.

I think this should be done with a script, and that script should be
in our repository.  The easiest kind of a script is a Lisp program, of
course, but we can also use other kinds, such as Awk scripts.

> Regarding the license, I'm even less of a lawyer than you, but these[3]
> are the terms that cover this data.

Richard, could you please look at that license and tell if we can use
this data file?

> > (Also, I'm somewhat surprised that gbk isn't capable of covering the
> > characters you want to add.  Or did you not try using it?)
> 
> I did not try using it! Mostly because the error message suggested
> gb18030 first. gbk also works. I don't have any opinion about encoding,
> apart from assuming utf8 unless there's a good reason not to.

I see no good reason to use anything other than UTF-8.

> [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=189523;msg=18
> 
> [2]  
> https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/jni/data/rawdict_utf16_65105_freq.txt
> 
> [3]  
> https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/NOTICE

Thanks.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]