bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#50247: 27.2; wrong `word-wrap' for Chinese characters


From: Eli Zaretskii
Subject: bug#50247: 27.2; wrong `word-wrap' for Chinese characters
Date: Sun, 29 Aug 2021 10:26:56 +0300

> Date: Sun, 29 Aug 2021 11:14:40 +0800
> From:  ClaudeMonet via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org>
> 
> When `toggle-word-wrap' is enabled, lines that ends with Chinese
> characters and Chinese punctuations won't be seperated in the right
> way, "normally", all Chinese words in a sentence will be crowded and
> recognized by Emacs as one single WORD.
> 
> e.g. "世界" is a word in
> Chinese, and "世界人民大团结万岁。" is a full sentence ending with a
> full width perid, and Emacs would recognize the sentence as a word, thus
> wrap lines in a wrong way.

Emacs 28 introduces the variable word-wrap-by-category; if you set
that non-nil, the above should work as you expect, assuming the
Kinsoku rules are good enough for that.  (Since you didn't tell in
detail what were your expectation of the "right way" in this case, I
couldn't actually test that the results are as you expect.)

> By the way, I think this one have long been a problem for Chinese users,
> since we use full-width punctuation system instead in English half-width
> is more generally adopted.

Please elaborate in what way this presents a problem in Emacs,
preferably with examples.

> Another thing is, in Emacs when you use
> `forward-word' key binding, I know English words are all separated
> either by punctuations or blank characters(<space>, <tab>, etc.), but in
> Chinese, words in a single sentence are usually separated by nothing, I
> don't know what the normal practice for "word recognizing" tasks is on
> modern OS like Mac and Windows. I guess there is a dictionary mechanism.

Emacs has find-word-boundary-function-table, which can be used to
define our rules.  In general, we try to follow Unicode, but AFAIU
Unicode TR29 doesn't specify any word-breaking rules for Chinese
characters.

> A footnote here, for tokenizing Chinese words, there is a Python
> tokenizor called "jieba" in NLP field, would be a great reference if you
> guys are going to address this issue. The github link of "jieba" is:
> 
>       https://github.com/fxsjy/jieba

Patches are welcome to add Chinese text segmentation capabilities to
Emacs.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]