[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: regexp filter to match non-english characters
From: |
Robert D. Crawford |
Subject: |
Re: regexp filter to match non-english characters |
Date: |
Thu, 06 Nov 2008 14:41:30 -0600 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux) |
Hello Ted. Thanks for all the work. See below for comments.
Ted Zlatanov <tzz@lifelogs.com> writes:
> On Thu, 06 Nov 2008 10:43:38 -0600 "Robert D. Crawford" <rdc1x@comcast.net>
> wrote:
>
> RDC> Score files are great. Truth be told, I'm just looking for what works.
> RDC> I like your solution but it will exclude posts with unicode characters,
> RDC> which is something I would like to avoid if possible.
>
> OK, so the question now is "how to tell if a character is in the Asian
> Unicode character ranges." Unfortunately I recall Emacs' own character
> database will misrepresent some Latin characters, so I wouldn't depend
> on character properties.
>
> I looked at ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt and picked
> the blocks that looked useful.
[snip long code]
> Evaluating this (you have to load the 'cl library too) gives
>
> "[^\\u0D00-\\u0D7F\\u0D80-\\u0DFF\\u0E00-\\u0E7F\\u0E80-\\u0EFF\\u0F00-\\u0FFF\\u1000-\\u109F\\u1780-\\u17FF\\u1800-\\u18AF\\u1900-\\u194F\\u1950-\\u197F\\u1980-\\u19DF\\u19E0-\\u19FF\\u1A00-\\u1A1F\\u1B00-\\u1B7F\\u2E80-\\u2EFF\\u2F00-\\u2FDF\\u2FF0-\\u2FFF\\u3000-\\u303F\\u3040-\\u309F\\u30A0-\\u30FF\\u3100-\\u312F\\u3130-\\u318F\\u3190-\\u319F\\u31A0-\\u31BF\\u31C0-\\u31EF\\u31F0-\\u31FF\\u3200-\\u32FF\\u3300-\\u33FF\\u3400-\\u4DBF\\u4DC0-\\u4DFF\\u4E00-\\u9FFF\\uA000-\\uA48F\\uA490-\\uA4CF\\uAC00-\\uD7AF\\uF900-\\uFAFF]"
>
> I don't know if this is good enough for you, but the ranges are correct
> at least and you see how you can add more. I tested with a few
> characters like this:
>
> (string-match (zme) "helloà´€")
>
> and it seems to work OK. In a score file you'll have only one backslash
> but otherwise it should work.
I tested with this:
(string-match
"[^\\u0D00-\\u0D7F\\u0D80-\\u0DFF\\u0E00-\\u0E7F\\u0E80-\\u0EFF\\u0F00-\\u0FFF\\u1000-\\u109F\\u1780-\\u17FF\\u1800-\\u18AF\\u1900-\\u194F\\u1950-\\u197F\\u1980-\\u19DF\\u19E0-\\u19FF\\u1A00-\\u1A1F\\u1B00-\\u1B7F\\u2E80-\\u2EFF\\u2F00-\\u2FDF\\u2FF0-\\u2FFF\\u3000-\\u303F\\u3040-\\u309F\\u30A0-\\u30FF\\u3100-\\u312F\\u3130-\\u318F\\u3190-\\u319F\\u31A0-\\u31BF\\u31C0-\\u31EF\\u31F0-\\u31FF\\u3200-\\u32FF\\u3300-\\u33FF\\u3400-\\u4DBF\\u4DC0-\\u4DFF\\u4E00-\\u9FFF\\uA000-\\uA48F\\uA490-\\uA4CF\\uAC00-\\uD7AF\\uF900-\\uFAFF]"
">>")
and it returns nil. Great!
Testing with the unicode character » (C-q 273 RET) returns 0. Curses.
Thank you for all your help. This has been way more difficult than I
thought it would be. Considering that, if you don't feel the need to
continue, I agree. Some would take this as a personal challenge to make
it work. I however have other things to do than to track down why this
regexp doesn't work. I can't spend time on it and I surely don't expect
you to do so either.
Thanks again, but I think I'll be hitting C-k and saving my time for
other things.
rdc
--
Robert D. Crawford rdc1x@comcast.net
QOTD:
"I only touch base with reality on an as-needed basis!"
- regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
- Message not available
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
- Re: regexp filter to match non-english characters, Michal Nazarewicz, 2008/11/06
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06
- Message not available
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
- Re: regexp filter to match non-english characters,
Robert D. Crawford <=
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06