[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#34862: 27.0.50; Trying to update pinyin.map
From: |
Eli Zaretskii |
Subject: |
bug#34862: 27.0.50; Trying to update pinyin.map |
Date: |
Wed, 20 Mar 2019 11:45:30 +0200 |
> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Fri, 15 Mar 2019 11:31:40 -0700
>
> > That file is imported from an external source, isn't it? Are you
> > saying we should stop synchronizing it with that source, and instead
> > fork it, maintain our own separate copy, and never resync with that
> > source again? If so, then I see no reason not to recode it in UTF-8.
>
> Near as I can tell that file was imported into Emacs in 2001 and not
> touched since (apart from copyright and encoding stuff). The Debian
> package from which it comes seems to have been orphaned in 2003[1]. So
> there's not much to either synchronize or fork!
OK, sounds reasonable.
> > Btw, I understand that the Google pinyin method is Apache licensed,
> > but does this mean we can freely use its data for updating pinyin.map?
> > IANAL. Could you perhaps describe how you intend to extract the data
> > from the Google input method for the purpose of updating our file? I
> > think someone will have to audit that process for being legal and
> > compatible with both the Apache license and the GPL.
>
> This[2] is the source file I used. I chopped off all the
> multiple-character dictionary entries, and munged the remaining data
> into the format we need. Ie, lines like this:
>
> 八 6677.54934466 0 ba
> 把 165484.231697 0 ba
> 吧 385205.434615 0 ba
>
> Became this:
>
> ba 吧把八
>
> A straight rearrangement, with frequency of use translated into simple
> ordering of the characters. While this is obviously pretty manual, and a
> bit of work, a file like this really only needs to be updated every five
> years or so -- if that. Whenever someone thinks of it.
I think this should be done with a script, and that script should be
in our repository. The easiest kind of a script is a Lisp program, of
course, but we can also use other kinds, such as Awk scripts.
> Regarding the license, I'm even less of a lawyer than you, but these[3]
> are the terms that cover this data.
Richard, could you please look at that license and tell if we can use
this data file?
> > (Also, I'm somewhat surprised that gbk isn't capable of covering the
> > characters you want to add. Or did you not try using it?)
>
> I did not try using it! Mostly because the error message suggested
> gb18030 first. gbk also works. I don't have any opinion about encoding,
> apart from assuming utf8 unless there's a good reason not to.
I see no good reason to use anything other than UTF-8.
> [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=189523;msg=18
>
> [2]
> https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/jni/data/rawdict_utf16_65105_freq.txt
>
> [3]
> https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/NOTICE
Thanks.
- bug#34862: 27.0.50; Trying to update pinyin.map, Eric Abrahamsen, 2019/03/14
- bug#34862: 27.0.50; Trying to update pinyin.map, Eli Zaretskii, 2019/03/15
- bug#34862: 27.0.50; Trying to update pinyin.map, Eric Abrahamsen, 2019/03/15
- bug#34862: 27.0.50; Trying to update pinyin.map, Eli Zaretskii, 2019/03/15
- bug#34862: 27.0.50; Trying to update pinyin.map, Eric Abrahamsen, 2019/03/15
- bug#34862: 27.0.50; Trying to update pinyin.map,
Eli Zaretskii <=
- bug#34862: 27.0.50; Trying to update pinyin.map, Eric Abrahamsen, 2019/03/20
- bug#34862: 27.0.50; Trying to update pinyin.map, Eli Zaretskii, 2019/03/20
- bug#34862: 27.0.50; Trying to update pinyin.map, Eric Abrahamsen, 2019/03/20