[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strin
From: |
Michal Nazarewicz |
Subject: |
bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings |
Date: |
Thu, 15 Sep 2016 16:23:54 +0200 |
User-agent: |
Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/25.1.50.106 (x86_64-unknown-linux-gnu) |
On Tue, Sep 13 2016, Eli Zaretskii wrote:
> Currently, case changes in unibyte characters and strings are only
> well defined for pure ASCII text; if the input or the result is not
> pure ASCII, we produce "undefined behavior".
Would the following (not tested) make sense then:
diff --git a/src/casefiddle.c b/src/casefiddle.c
index 2d32f49..4dc2357 100644
--- a/src/casefiddle.c
+++ b/src/casefiddle.c
@@ -89,23 +89,19 @@ casify_object (enum case_action flag, Lisp_Object obj)
for (i = 0; i < size; i++)
{
c = SREF (obj, i);
- MAKE_CHAR_MULTIBYTE (c);
c1 = c;
- if (inword && flag != CASE_CAPITALIZE_UP)
- c = downcase (c);
- else if (!uppercasep (c)
- && (!inword || flag != CASE_CAPITALIZE_UP))
- c = upcase1 (c1);
- if ((int) flag >= (int) CASE_CAPITALIZE)
- inword = (SYNTAX (c) == Sword);
- if (c != c1)
+ if (ASCII_CHAR_P (c))
{
- MAKE_CHAR_UNIBYTE (c);
- /* If the char can't be converted to a valid byte, just don't
- change it. */
- if (c >= 0 && c < 256)
- SSET (obj, i, c);
+ if (inword && flag != CASE_CAPITALIZE_UP)
+ c = downcase (c);
+ else if (!uppercasep (c)
+ && (!inword || flag != CASE_CAPITALIZE_UP))
+ c = upcase1 (c1);
}
+ if ((int) flag >= (int) CASE_CAPITALIZE)
+ inword = (SYNTAX (c) == Sword);
+ if (c != c1 && ASCII_CHAR_P (c))
+ SSET (obj, i, c);
}
return obj;
}
@@ -230,8 +226,9 @@ casify_region (enum case_action flag, Lisp_Object b,
Lisp_Object e)
else
{
c = FETCH_BYTE (start_byte);
- MAKE_CHAR_MULTIBYTE (c);
len = 1;
+ if (!ASCII_CHAR_P (c))
+ goto done;
}
c2 = c;
if (inword && flag != CASE_CAPITALIZE_UP)
@@ -239,9 +236,6 @@ casify_region (enum case_action flag, Lisp_Object b,
Lisp_Object e)
else if (!uppercasep (c)
&& (!inword || flag != CASE_CAPITALIZE_UP))
c = upcase1 (c);
- if ((int) flag >= (int) CASE_CAPITALIZE)
- inword = ((SYNTAX (c) == Sword)
- && (inword || !syntax_prefix_flag_p (c)));
if (c != c2)
{
last = start;
@@ -250,8 +244,8 @@ casify_region (enum case_action flag, Lisp_Object b,
Lisp_Object e)
if (! multibyte)
{
- MAKE_CHAR_UNIBYTE (c);
- FETCH_BYTE (start_byte) = c;
+ if (ASCII_CHAR_P (c))
+ FETCH_BYTE (start_byte) = c;
}
else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c))
FETCH_BYTE (start_byte) = c;
@@ -280,6 +274,10 @@ casify_region (enum case_action flag, Lisp_Object b,
Lisp_Object e)
}
}
}
+ done:
+ if ((int) flag >= (int) CASE_CAPITALIZE)
+ inword = ((SYNTAX (c) == Sword)
+ && (inword || !syntax_prefix_flag_p (c)));
start++;
start_byte += len;
}
If working on non-ASCII characters isn’t supported we might just as well
skip all the logic that handles non-ASCII unibyte characters.
> Properly means that upcasing "istanbul" in the above example will
> produce "İSTANBUL", not "iSTANBUL", and downcasing "IRMA" will produce
> "ırma".
I thought about that but then another corner case is "istanbul\xff"
which is a unibyte string with 8-bit bytes.
I have no strong feelings either way so I’m happy just leaving it as is
as well.
--
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»