emacs-orgmode
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [O] [BUG] Mark-up handling chokes on Unicode white-space


From: Tobias Getzner
Subject: Re: [O] [BUG] Mark-up handling chokes on Unicode white-space
Date: Wed, 24 Sep 2014 09:34:25 +0200

Hi Aaron,

On Di, 2014-09-23 at 14:15 -0400, Aaron Ecay wrote:
> org-emphasis-regexp-components is known to be a wart.  You can search
> for posts on the mailing list.  Some people are trying to figure out how
> to get rid of it.  (You can search in particular for Nicolas Goaziou’s
> posts...)  Here’s one thread where you can see the lay of the land:
> <http://mid.gmane.org/address@hidden>.

Thank you for the background info!

> All that to say, the longer-term solution is to figure out some radically
> different approach.  In the meantime though, if you can provide a list of
> characters (by unicode name and/or code point) that you think should be
> added to that variable, someone might be able to add them. 

I guess the straightforward way of defining white-space would be just
using the set of characters with the Unicode property WSpace=Y, and
this would be what «[:space:]», «\s«, etc., should be expected to match
on Unicode-based locales. I’m supplying a list of code-points below,
for convenience.

I agree though that defining what counts as «white space» within the
confines of org-mode is putting the cart before the horse. I’ll try to
ascertain whether the Emacs implementation of «[:space:]» really only
does 8-bit spaces, and if so I’ll see whether I can poke someone on the
Emacs bug tracker about this.

Best regards,
T.


──────────────────────────────────────────────────────────────────────
List of Unicode white-space

Below is the list of characters with the property White_Space set,
taken from the Unicode 7.0.0 character database. This includes
line-breaking white-space such as «line feed». If these are not
relevant, one can use the subset of space separators (Zs; these do not
include control characters such as Tab) and control chars (Cc).

0009..000D    ; White_Space # Cc   [5] <control-0009>..<control-000D>
0020          ; White_Space # Zs       SPACE
0085          ; White_Space # Cc       <control-0085>
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
2028          ; White_Space # Zl       LINE SEPARATOR
2029          ; White_Space # Zp       PARAGRAPH SEPARATOR
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE
──────────────────────────────────────────────────────────────────────





reply via email to

[Prev in Thread] Current Thread [Next in Thread]