[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#34862: 27.0.50; Trying to update pinyin.map
From: |
Eric Abrahamsen |
Subject: |
bug#34862: 27.0.50; Trying to update pinyin.map |
Date: |
Wed, 20 Mar 2019 12:30:22 -0700 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) |
On 03/20/19 11:45 AM, Eli Zaretskii wrote:
[...]
>> > Btw, I understand that the Google pinyin method is Apache licensed,
>> > but does this mean we can freely use its data for updating pinyin.map?
>> > IANAL. Could you perhaps describe how you intend to extract the data
>> > from the Google input method for the purpose of updating our file? I
>> > think someone will have to audit that process for being legal and
>> > compatible with both the Apache license and the GPL.
>>
>> This[2] is the source file I used. I chopped off all the
>> multiple-character dictionary entries, and munged the remaining data
>> into the format we need. Ie, lines like this:
>>
>> 八 6677.54934466 0 ba
>> 把 165484.231697 0 ba
>> 吧 385205.434615 0 ba
>>
>> Became this:
>>
>> ba 吧把八
>>
>> A straight rearrangement, with frequency of use translated into simple
>> ordering of the characters. While this is obviously pretty manual, and a
>> bit of work, a file like this really only needs to be updated every five
>> years or so -- if that. Whenever someone thinks of it.
>
> I think this should be done with a script, and that script should be
> in our repository. The easiest kind of a script is a Lisp program, of
> course, but we can also use other kinds, such as Awk scripts.
Awk seems just right for the problem, but I haven't written much in it;
I did the original munging in elisp. Would this be a script written for
use with -batch and a custom make target? Or something to be loaded into
a running Emacs and called interactively? In either case, should it also
be responsible for downloading a recent copy of the source file, or
should that be done first, and the function pointed at the file?
>> Regarding the license, I'm even less of a lawyer than you, but these[3]
>> are the terms that cover this data.
>
> Richard, could you please look at that license and tell if we can use
> this data file?
>
>> > (Also, I'm somewhat surprised that gbk isn't capable of covering the
>> > characters you want to add. Or did you not try using it?)
>>
>> I did not try using it! Mostly because the error message suggested
>> gb18030 first. gbk also works. I don't have any opinion about encoding,
>> apart from assuming utf8 unless there's a good reason not to.
>
> I see no good reason to use anything other than UTF-8.
Excellent. I will think about the script, and look forward to word from
Richard.
Eric
- bug#34862: 27.0.50; Trying to update pinyin.map, Eric Abrahamsen, 2019/03/14
- bug#34862: 27.0.50; Trying to update pinyin.map, Eli Zaretskii, 2019/03/15
- bug#34862: 27.0.50; Trying to update pinyin.map, Eric Abrahamsen, 2019/03/15
- bug#34862: 27.0.50; Trying to update pinyin.map, Eli Zaretskii, 2019/03/15
- bug#34862: 27.0.50; Trying to update pinyin.map, Eric Abrahamsen, 2019/03/15
- bug#34862: 27.0.50; Trying to update pinyin.map, Eli Zaretskii, 2019/03/20
- bug#34862: 27.0.50; Trying to update pinyin.map,
Eric Abrahamsen <=
- bug#34862: 27.0.50; Trying to update pinyin.map, Eli Zaretskii, 2019/03/20
- bug#34862: 27.0.50; Trying to update pinyin.map, Eric Abrahamsen, 2019/03/20