bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strin

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strin

From:	Michal Nazarewicz
Subject:	bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings
Date:	Thu, 15 Sep 2016 16:23:54 +0200
User-agent:	Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/25.1.50.106 (x86_64-unknown-linux-gnu)

On Tue, Sep 13 2016, Eli Zaretskii wrote:
> Currently, case changes in unibyte characters and strings are only
> well defined for pure ASCII text; if the input or the result is not
> pure ASCII, we produce "undefined behavior".

Would the following (not tested) make sense then:

diff --git a/src/casefiddle.c b/src/casefiddle.c
index 2d32f49..4dc2357 100644
--- a/src/casefiddle.c
+++ b/src/casefiddle.c
@@ -89,23 +89,19 @@ casify_object (enum case_action flag, Lisp_Object obj)
       for (i = 0; i < size; i++)
        {
          c = SREF (obj, i);
-         MAKE_CHAR_MULTIBYTE (c);
          c1 = c;
-         if (inword && flag != CASE_CAPITALIZE_UP)
-           c = downcase (c);
-         else if (!uppercasep (c)
-                  && (!inword || flag != CASE_CAPITALIZE_UP))
-           c = upcase1 (c1);
-         if ((int) flag >= (int) CASE_CAPITALIZE)
-           inword = (SYNTAX (c) == Sword);
-         if (c != c1)
+         if (ASCII_CHAR_P (c))
            {
-             MAKE_CHAR_UNIBYTE (c);
-             /* If the char can't be converted to a valid byte, just don't
-                change it.  */
-             if (c >= 0 && c < 256)
-               SSET (obj, i, c);
+             if (inword && flag != CASE_CAPITALIZE_UP)
+               c = downcase (c);
+             else if (!uppercasep (c)
+                      && (!inword || flag != CASE_CAPITALIZE_UP))
+               c = upcase1 (c1);
            }
+         if ((int) flag >= (int) CASE_CAPITALIZE)
+           inword = (SYNTAX (c) == Sword);
+         if (c != c1 && ASCII_CHAR_P (c))
+           SSET (obj, i, c);
        }
       return obj;
     }
@@ -230,8 +226,9 @@ casify_region (enum case_action flag, Lisp_Object b, 
Lisp_Object e)
       else
        {
          c = FETCH_BYTE (start_byte);
-         MAKE_CHAR_MULTIBYTE (c);
          len = 1;
+         if (!ASCII_CHAR_P (c))
+           goto done;
        }
       c2 = c;
       if (inword && flag != CASE_CAPITALIZE_UP)
@@ -239,9 +236,6 @@ casify_region (enum case_action flag, Lisp_Object b, 
Lisp_Object e)
       else if (!uppercasep (c)
               && (!inword || flag != CASE_CAPITALIZE_UP))
        c = upcase1 (c);
-      if ((int) flag >= (int) CASE_CAPITALIZE)
-       inword = ((SYNTAX (c) == Sword)
-                 && (inword || !syntax_prefix_flag_p (c)));
       if (c != c2)
        {
          last = start;
@@ -250,8 +244,8 @@ casify_region (enum case_action flag, Lisp_Object b, 
Lisp_Object e)
 
          if (! multibyte)
            {
-             MAKE_CHAR_UNIBYTE (c);
-             FETCH_BYTE (start_byte) = c;
+             if (ASCII_CHAR_P (c))
+               FETCH_BYTE (start_byte) = c;
            }
          else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c))
            FETCH_BYTE (start_byte) = c;
@@ -280,6 +274,10 @@ casify_region (enum case_action flag, Lisp_Object b, 
Lisp_Object e)
                }
            }
        }
+    done:
+      if ((int) flag >= (int) CASE_CAPITALIZE)
+       inword = ((SYNTAX (c) == Sword)
+                 && (inword || !syntax_prefix_flag_p (c)));
       start++;
       start_byte += len;
     }

If working on non-ASCII characters isn’t supported we might just as well
skip all the logic that handles non-ASCII unibyte characters.

> Properly means that upcasing "istanbul" in the above example will
> produce "İSTANBUL", not "iSTANBUL", and downcasing "IRMA" will produce
> "ırma".

I thought about that but then another corner case is "istanbul\xff"
which is a unibyte string with 8-bit bytes.

I have no strong feelings either way so I’m happy just leaving it as is
as well.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

[Prev in Thread]

Current Thread

[Next in Thread]

bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings, Michal Nazarewicz, 2016/09/12
- bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings, Eli Zaretskii, 2016/09/13
  - bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings, Michal Nazarewicz <=
    - bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings, Eli Zaretskii, 2016/09/15
    - bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings, Michal Nazarewicz, 2016/09/16

Prev by Date: bug#21730: 25.0.50; Random errors in redisplay--pre-redisplay-functions
Next by Date: bug#24435: 25.1; Problem using Hunspell
Previous by thread: bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings
Next by thread: bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings
Index(es):
- Date
- Thread