bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34862: 27.0.50; Trying to update pinyin.map


From: Eric Abrahamsen
Subject: bug#34862: 27.0.50; Trying to update pinyin.map
Date: Wed, 20 Mar 2019 12:30:22 -0700
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux)

On 03/20/19 11:45 AM, Eli Zaretskii wrote:

[...]

>> > Btw, I understand that the Google pinyin method is Apache licensed,
>> > but does this mean we can freely use its data for updating pinyin.map?
>> > IANAL. Could you perhaps describe how you intend to extract the data
>> > from the Google input method for the purpose of updating our file? I
>> > think someone will have to audit that process for being legal and
>> > compatible with both the Apache license and the GPL.
>> 
>> This[2] is the source file I used. I chopped off all the
>> multiple-character dictionary entries, and munged the remaining data
>> into the format we need. Ie, lines like this:
>> 
>> 八 6677.54934466 0 ba
>> 把 165484.231697 0 ba
>> 吧 385205.434615 0 ba
>> 
>> Became this:
>> 
>> ba 吧把八
>> 
>> A straight rearrangement, with frequency of use translated into simple
>> ordering of the characters. While this is obviously pretty manual, and a
>> bit of work, a file like this really only needs to be updated every five
>> years or so -- if that. Whenever someone thinks of it.
>
> I think this should be done with a script, and that script should be
> in our repository.  The easiest kind of a script is a Lisp program, of
> course, but we can also use other kinds, such as Awk scripts.

Awk seems just right for the problem, but I haven't written much in it;
I did the original munging in elisp. Would this be a script written for
use with -batch and a custom make target? Or something to be loaded into
a running Emacs and called interactively? In either case, should it also
be responsible for downloading a recent copy of the source file, or
should that be done first, and the function pointed at the file?

>> Regarding the license, I'm even less of a lawyer than you, but these[3]
>> are the terms that cover this data.
>
> Richard, could you please look at that license and tell if we can use
> this data file?
>
>> > (Also, I'm somewhat surprised that gbk isn't capable of covering the
>> > characters you want to add.  Or did you not try using it?)
>> 
>> I did not try using it! Mostly because the error message suggested
>> gb18030 first. gbk also works. I don't have any opinion about encoding,
>> apart from assuming utf8 unless there's a good reason not to.
>
> I see no good reason to use anything other than UTF-8.

Excellent. I will think about the script, and look forward to word from
Richard.

Eric





reply via email to

[Prev in Thread] Current Thread [Next in Thread]