bug#24603: [RFC 08/18] Support casing characters which map into multiple

bug-gnu-emacs
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#24603: [RFC 08/18] Support casing characters which map into multiple

From:	Michal Nazarewicz
Subject:	bug#24603: [RFC 08/18] Support casing characters which map into multiple code points
Date:	Tue, 4 Oct 2016 03:10:31 +0200
Implement uncoditional special casing rules defined in Unicode standard.

Among other thigs, they deal with cases when a single code point is
replaced by multiple ones becasue simple character does not exist (e.g.
ﬁ ligature turning into FL) or is not commonly used (e.g. ß turning into
SS).

* admin/unidata/SpecialCasing.txt: New data file pulled from Unicode
standard distribution.
* admin/unidata/README: Mention SpecialCasing.txt.

* src/make-special-casing.py: New script to generate special-casing.h
file from the SpecialCasing.txt data file.

* src/casefiddle.c: Include special-casing.h so special casing rules are
available and can be used in the translation unit.

(struct casing_str_buf): New structure for representing short strings.
It’s used to compactly encode special casing rules.

(case_character_imlp): New function which can handle one-to-many
character mappings.
(case_character, case_single_character): Wrappers for the above
functions.  The former may map one character to multiple code points
while the latter does what the former used to do (i.e. handles
one-to-one mappings only).

(do_casify_integer, do_casify_unibyte_string,
do_casify_unibyte_region): Use case_single_character.
(do_casify_multibyte_string, do_casify_multibyte_region): Support new
features of case_character.
* (do_casify_region): Updated after do_casify_multibyte_string changes.

(upcase, capitalize, upcase-initials): Update documentation to mention
limitations when working on characters.

* test/src/casefiddle-tests.el (casefiddle-tests-casing): Update test
cases which are now passing.

* test/lisp/char-fold-tests.el (char-fold--ascii-upcase,
char-fold--ascii-downcase): New functions which behave like old ‘upcase’
and ‘downcase’.
(char-fold--test-match-exactly): Use the new functions.  This is needed
because otherwise ﬁ and similar characters are turned into their multi-
-character representation.
---
 .gitignore                      |   1 +
 admin/unidata/README            |   4 +
 admin/unidata/SpecialCasing.txt | 281 ++++++++++++++++++++++++++++++++++++++++
 etc/NEWS                        |  16 ++-
 src/Makefile.in                 |   3 +
 src/casefiddle.c                | 218 ++++++++++++++++++++++---------
 src/deps.mk                     |   2 +-
 src/make-special-casing.py      | 189 +++++++++++++++++++++++++++
 test/lisp/char-fold-tests.el    |  12 +-
 test/src/casefiddle-tests.el    |   9 +-
 10 files changed, 658 insertions(+), 77 deletions(-)
 create mode 100644 admin/unidata/SpecialCasing.txt
 create mode 100644 src/make-special-casing.py

diff --git a/.gitignore b/.gitignore
index 15f9c56..a07f972 100644
--- a/.gitignore
+++ b/.gitignore
@@ -79,6 +79,7 @@ lib/warn-on-use.h
 src/buildobj.h
 src/globals.h
 src/lisp.mk
+src/special-casing.h
 
 # Lisp-level sources built by 'make'.
 *cus-load.el
diff --git a/admin/unidata/README b/admin/unidata/README
index 534670c..06a6666 100644
--- a/admin/unidata/README
+++ b/admin/unidata/README
@@ -24,3 +24,7 @@ http://www.unicode.org/Public/8.0.0/ucd/Blocks.txt
 NormalizationTest.txt
 http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
 2016-07-16
+
+SpecialCasing.txt
+http://unicode.org/Public/UNIDATA/SpecialCasing.txt
+2016-03-03
diff --git a/admin/unidata/SpecialCasing.txt b/admin/unidata/SpecialCasing.txt
new file mode 100644
index 0000000..b23fa7f
--- /dev/null
+++ b/admin/unidata/SpecialCasing.txt
@@ -0,0 +1,281 @@
+# SpecialCasing-9.0.0.txt
+# Date: 2016-03-02, 18:55:13 GMT
+# © 2016 Unicode®, Inc.
+# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in 
the U.S. and other countries.
+# For terms of use, see http://www.unicode.org/terms_of_use.html
+#
+# Unicode Character Database
+#   For documentation, see http://www.unicode.org/reports/tr44/
+#
+# Special Casing
+#
+# This file is a supplement to the UnicodeData.txt file. It does not define any
+# properties, but rather provides additional information about the casing of
+# Unicode characters, for situations when casing incurs a change in string 
length
+# or is dependent on context or locale. For compatibility, the UnicodeData.txt
+# file only contains simple case mappings for characters where they are 
one-to-one
+# and independent of context and language. The data in this file, combined with
+# the simple case mappings in UnicodeData.txt, defines the full case mappings
+# Lowercase_Mapping (lc), Titlecase_Mapping (tc), and Uppercase_Mapping (uc).
+#
+# Note that the preferred mechanism for defining tailored casing operations is
+# the Unicode Common Locale Data Repository (CLDR). For more information, see 
the
+# discussion of case mappings and case algorithms in the Unicode Standard.
+#
+# All code points not listed in this file that do not have a simple case 
mappings
+# in UnicodeData.txt map to themselves.
+# 
================================================================================
+# Format
+# 
================================================================================
+# The entries in this file are in the following machine-readable format:
+#
+# <code>; <lower>; <title>; <upper>; (<condition_list>;)? # <comment>
+#
+# <code>, <lower>, <title>, and <upper> provide the respective full case 
mappings
+# of <code>, expressed as character values in hex. If there is more than one 
character,
+# they are separated by spaces. Other than as used to separate elements, 
spaces are
+# to be ignored.
+#
+# The <condition_list> is optional. Where present, it consists of one or more 
language IDs
+# or casing contexts, separated by spaces. In these conditions:
+# - A condition list overrides the normal behavior if all of the listed 
conditions are true.
+# - The casing context is always the context of the characters in the original 
string,
+#   NOT in the resulting string.
+# - Case distinctions in the condition list are not significant.
+# - Conditions preceded by "Not_" represent the negation of the condition.
+# The condition list is not represented in the UCD as a formal property.
+#
+# A language ID is defined by BCP 47, with '-' and '_' treated equivalently.
+#
+# A casing context for a character is defined by Section 3.13 Default Case 
Algorithms
+# of The Unicode Standard.
+#
+# Parsers of this file must be prepared to deal with future additions to this 
format:
+#  * Additional contexts
+#  * Additional fields
+# 
================================================================================
+
+# 
================================================================================
+# Unconditional mappings
+# 
================================================================================
+
+# The German es-zed is special--the normal mapping is to SS.
+# Note: the titlecase should never occur in practice. It is equal to 
titlecase(uppercase(<es-zed>))
+
+00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
+
+# Preserve canonical equivalence for I with dot. Turkic is handled below.
+
+0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
+
+# Ligatures
+
+FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF
+FB01; FB01; 0046 0069; 0046 0049; # LATIN SMALL LIGATURE FI
+FB02; FB02; 0046 006C; 0046 004C; # LATIN SMALL LIGATURE FL
+FB03; FB03; 0046 0066 0069; 0046 0046 0049; # LATIN SMALL LIGATURE FFI
+FB04; FB04; 0046 0066 006C; 0046 0046 004C; # LATIN SMALL LIGATURE FFL
+FB05; FB05; 0053 0074; 0053 0054; # LATIN SMALL LIGATURE LONG S T
+FB06; FB06; 0053 0074; 0053 0054; # LATIN SMALL LIGATURE ST
+
+0587; 0587; 0535 0582; 0535 0552; # ARMENIAN SMALL LIGATURE ECH YIWN
+FB13; FB13; 0544 0576; 0544 0546; # ARMENIAN SMALL LIGATURE MEN NOW
+FB14; FB14; 0544 0565; 0544 0535; # ARMENIAN SMALL LIGATURE MEN ECH
+FB15; FB15; 0544 056B; 0544 053B; # ARMENIAN SMALL LIGATURE MEN INI
+FB16; FB16; 054E 0576; 054E 0546; # ARMENIAN SMALL LIGATURE VEW NOW
+FB17; FB17; 0544 056D; 0544 053D; # ARMENIAN SMALL LIGATURE MEN XEH
+
+# No corresponding uppercase precomposed character
+
+0149; 0149; 02BC 004E; 02BC 004E; # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
+0390; 0390; 0399 0308 0301; 0399 0308 0301; # GREEK SMALL LETTER IOTA WITH 
DIALYTIKA AND TONOS
+03B0; 03B0; 03A5 0308 0301; 03A5 0308 0301; # GREEK SMALL LETTER UPSILON WITH 
DIALYTIKA AND TONOS
+01F0; 01F0; 004A 030C; 004A 030C; # LATIN SMALL LETTER J WITH CARON
+1E96; 1E96; 0048 0331; 0048 0331; # LATIN SMALL LETTER H WITH LINE BELOW
+1E97; 1E97; 0054 0308; 0054 0308; # LATIN SMALL LETTER T WITH DIAERESIS
+1E98; 1E98; 0057 030A; 0057 030A; # LATIN SMALL LETTER W WITH RING ABOVE
+1E99; 1E99; 0059 030A; 0059 030A; # LATIN SMALL LETTER Y WITH RING ABOVE
+1E9A; 1E9A; 0041 02BE; 0041 02BE; # LATIN SMALL LETTER A WITH RIGHT HALF RING
+1F50; 1F50; 03A5 0313; 03A5 0313; # GREEK SMALL LETTER UPSILON WITH PSILI
+1F52; 1F52; 03A5 0313 0300; 03A5 0313 0300; # GREEK SMALL LETTER UPSILON WITH 
PSILI AND VARIA
+1F54; 1F54; 03A5 0313 0301; 03A5 0313 0301; # GREEK SMALL LETTER UPSILON WITH 
PSILI AND OXIA
+1F56; 1F56; 03A5 0313 0342; 03A5 0313 0342; # GREEK SMALL LETTER UPSILON WITH 
PSILI AND PERISPOMENI
+1FB6; 1FB6; 0391 0342; 0391 0342; # GREEK SMALL LETTER ALPHA WITH PERISPOMENI
+1FC6; 1FC6; 0397 0342; 0397 0342; # GREEK SMALL LETTER ETA WITH PERISPOMENI
+1FD2; 1FD2; 0399 0308 0300; 0399 0308 0300; # GREEK SMALL LETTER IOTA WITH 
DIALYTIKA AND VARIA
+1FD3; 1FD3; 0399 0308 0301; 0399 0308 0301; # GREEK SMALL LETTER IOTA WITH 
DIALYTIKA AND OXIA
+1FD6; 1FD6; 0399 0342; 0399 0342; # GREEK SMALL LETTER IOTA WITH PERISPOMENI
+1FD7; 1FD7; 0399 0308 0342; 0399 0308 0342; # GREEK SMALL LETTER IOTA WITH 
DIALYTIKA AND PERISPOMENI
+1FE2; 1FE2; 03A5 0308 0300; 03A5 0308 0300; # GREEK SMALL LETTER UPSILON WITH 
DIALYTIKA AND VARIA
+1FE3; 1FE3; 03A5 0308 0301; 03A5 0308 0301; # GREEK SMALL LETTER UPSILON WITH 
DIALYTIKA AND OXIA
+1FE4; 1FE4; 03A1 0313; 03A1 0313; # GREEK SMALL LETTER RHO WITH PSILI
+1FE6; 1FE6; 03A5 0342; 03A5 0342; # GREEK SMALL LETTER UPSILON WITH PERISPOMENI
+1FE7; 1FE7; 03A5 0308 0342; 03A5 0308 0342; # GREEK SMALL LETTER UPSILON WITH 
DIALYTIKA AND PERISPOMENI
+1FF6; 1FF6; 03A9 0342; 03A9 0342; # GREEK SMALL LETTER OMEGA WITH PERISPOMENI
+
+# IMPORTANT-when iota-subscript (0345) is uppercased or titlecased,
+#  the result will be incorrect unless the iota-subscript is moved to the end
+#  of any sequence of combining marks. Otherwise, the accents will go on the 
capital iota.
+#  This process can be achieved by first transforming the text to NFC before 
casing.
+#  E.g. <alpha><iota_subscript><acute> is uppercased to <ALPHA><acute><IOTA>
+
+# The following cases are already in the UnicodeData.txt file, so are only 
commented here.
+
+# 0345; 0345; 0345; 0399; # COMBINING GREEK YPOGEGRAMMENI
+
+# All letters with YPOGEGRAMMENI (iota-subscript) or PROSGEGRAMMENI (iota 
adscript)
+# have special uppercases.
+# Note: characters with PROSGEGRAMMENI are actually titlecase, not uppercase!
+
+1F80; 1F80; 1F88; 1F08 0399; # GREEK SMALL LETTER ALPHA WITH PSILI AND 
YPOGEGRAMMENI
+1F81; 1F81; 1F89; 1F09 0399; # GREEK SMALL LETTER ALPHA WITH DASIA AND 
YPOGEGRAMMENI
+1F82; 1F82; 1F8A; 1F0A 0399; # GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA 
AND YPOGEGRAMMENI
+1F83; 1F83; 1F8B; 1F0B 0399; # GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA 
AND YPOGEGRAMMENI
+1F84; 1F84; 1F8C; 1F0C 0399; # GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA 
AND YPOGEGRAMMENI
+1F85; 1F85; 1F8D; 1F0D 0399; # GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA 
AND YPOGEGRAMMENI
+1F86; 1F86; 1F8E; 1F0E 0399; # GREEK SMALL LETTER ALPHA WITH PSILI AND 
PERISPOMENI AND YPOGEGRAMMENI
+1F87; 1F87; 1F8F; 1F0F 0399; # GREEK SMALL LETTER ALPHA WITH DASIA AND 
PERISPOMENI AND YPOGEGRAMMENI
+1F88; 1F80; 1F88; 1F08 0399; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND 
PROSGEGRAMMENI
+1F89; 1F81; 1F89; 1F09 0399; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND 
PROSGEGRAMMENI
+1F8A; 1F82; 1F8A; 1F0A 0399; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA 
AND PROSGEGRAMMENI
+1F8B; 1F83; 1F8B; 1F0B 0399; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA 
AND PROSGEGRAMMENI
+1F8C; 1F84; 1F8C; 1F0C 0399; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA 
AND PROSGEGRAMMENI
+1F8D; 1F85; 1F8D; 1F0D 0399; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA 
AND PROSGEGRAMMENI
+1F8E; 1F86; 1F8E; 1F0E 0399; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND 
PERISPOMENI AND PROSGEGRAMMENI
+1F8F; 1F87; 1F8F; 1F0F 0399; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND 
PERISPOMENI AND PROSGEGRAMMENI
+1F90; 1F90; 1F98; 1F28 0399; # GREEK SMALL LETTER ETA WITH PSILI AND 
YPOGEGRAMMENI
+1F91; 1F91; 1F99; 1F29 0399; # GREEK SMALL LETTER ETA WITH DASIA AND 
YPOGEGRAMMENI
+1F92; 1F92; 1F9A; 1F2A 0399; # GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND 
YPOGEGRAMMENI
+1F93; 1F93; 1F9B; 1F2B 0399; # GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND 
YPOGEGRAMMENI
+1F94; 1F94; 1F9C; 1F2C 0399; # GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND 
YPOGEGRAMMENI
+1F95; 1F95; 1F9D; 1F2D 0399; # GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND 
YPOGEGRAMMENI
+1F96; 1F96; 1F9E; 1F2E 0399; # GREEK SMALL LETTER ETA WITH PSILI AND 
PERISPOMENI AND YPOGEGRAMMENI
+1F97; 1F97; 1F9F; 1F2F 0399; # GREEK SMALL LETTER ETA WITH DASIA AND 
PERISPOMENI AND YPOGEGRAMMENI
+1F98; 1F90; 1F98; 1F28 0399; # GREEK CAPITAL LETTER ETA WITH PSILI AND 
PROSGEGRAMMENI
+1F99; 1F91; 1F99; 1F29 0399; # GREEK CAPITAL LETTER ETA WITH DASIA AND 
PROSGEGRAMMENI
+1F9A; 1F92; 1F9A; 1F2A 0399; # GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA 
AND PROSGEGRAMMENI
+1F9B; 1F93; 1F9B; 1F2B 0399; # GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA 
AND PROSGEGRAMMENI
+1F9C; 1F94; 1F9C; 1F2C 0399; # GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA 
AND PROSGEGRAMMENI
+1F9D; 1F95; 1F9D; 1F2D 0399; # GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA 
AND PROSGEGRAMMENI
+1F9E; 1F96; 1F9E; 1F2E 0399; # GREEK CAPITAL LETTER ETA WITH PSILI AND 
PERISPOMENI AND PROSGEGRAMMENI
+1F9F; 1F97; 1F9F; 1F2F 0399; # GREEK CAPITAL LETTER ETA WITH DASIA AND 
PERISPOMENI AND PROSGEGRAMMENI
+1FA0; 1FA0; 1FA8; 1F68 0399; # GREEK SMALL LETTER OMEGA WITH PSILI AND 
YPOGEGRAMMENI
+1FA1; 1FA1; 1FA9; 1F69 0399; # GREEK SMALL LETTER OMEGA WITH DASIA AND 
YPOGEGRAMMENI
+1FA2; 1FA2; 1FAA; 1F6A 0399; # GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA 
AND YPOGEGRAMMENI
+1FA3; 1FA3; 1FAB; 1F6B 0399; # GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA 
AND YPOGEGRAMMENI
+1FA4; 1FA4; 1FAC; 1F6C 0399; # GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA 
AND YPOGEGRAMMENI
+1FA5; 1FA5; 1FAD; 1F6D 0399; # GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA 
AND YPOGEGRAMMENI
+1FA6; 1FA6; 1FAE; 1F6E 0399; # GREEK SMALL LETTER OMEGA WITH PSILI AND 
PERISPOMENI AND YPOGEGRAMMENI
+1FA7; 1FA7; 1FAF; 1F6F 0399; # GREEK SMALL LETTER OMEGA WITH DASIA AND 
PERISPOMENI AND YPOGEGRAMMENI
+1FA8; 1FA0; 1FA8; 1F68 0399; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND 
PROSGEGRAMMENI
+1FA9; 1FA1; 1FA9; 1F69 0399; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND 
PROSGEGRAMMENI
+1FAA; 1FA2; 1FAA; 1F6A 0399; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA 
AND PROSGEGRAMMENI
+1FAB; 1FA3; 1FAB; 1F6B 0399; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA 
AND PROSGEGRAMMENI
+1FAC; 1FA4; 1FAC; 1F6C 0399; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA 
AND PROSGEGRAMMENI
+1FAD; 1FA5; 1FAD; 1F6D 0399; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA 
AND PROSGEGRAMMENI
+1FAE; 1FA6; 1FAE; 1F6E 0399; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND 
PERISPOMENI AND PROSGEGRAMMENI
+1FAF; 1FA7; 1FAF; 1F6F 0399; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND 
PERISPOMENI AND PROSGEGRAMMENI
+1FB3; 1FB3; 1FBC; 0391 0399; # GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
+1FBC; 1FB3; 1FBC; 0391 0399; # GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
+1FC3; 1FC3; 1FCC; 0397 0399; # GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
+1FCC; 1FC3; 1FCC; 0397 0399; # GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
+1FF3; 1FF3; 1FFC; 03A9 0399; # GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
+1FFC; 1FF3; 1FFC; 03A9 0399; # GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
+
+# Some characters with YPOGEGRAMMENI also have no corresponding titlecases
+
+1FB2; 1FB2; 1FBA 0345; 1FBA 0399; # GREEK SMALL LETTER ALPHA WITH VARIA AND 
YPOGEGRAMMENI
+1FB4; 1FB4; 0386 0345; 0386 0399; # GREEK SMALL LETTER ALPHA WITH OXIA AND 
YPOGEGRAMMENI
+1FC2; 1FC2; 1FCA 0345; 1FCA 0399; # GREEK SMALL LETTER ETA WITH VARIA AND 
YPOGEGRAMMENI
+1FC4; 1FC4; 0389 0345; 0389 0399; # GREEK SMALL LETTER ETA WITH OXIA AND 
YPOGEGRAMMENI
+1FF2; 1FF2; 1FFA 0345; 1FFA 0399; # GREEK SMALL LETTER OMEGA WITH VARIA AND 
YPOGEGRAMMENI
+1FF4; 1FF4; 038F 0345; 038F 0399; # GREEK SMALL LETTER OMEGA WITH OXIA AND 
YPOGEGRAMMENI
+
+1FB7; 1FB7; 0391 0342 0345; 0391 0342 0399; # GREEK SMALL LETTER ALPHA WITH 
PERISPOMENI AND YPOGEGRAMMENI
+1FC7; 1FC7; 0397 0342 0345; 0397 0342 0399; # GREEK SMALL LETTER ETA WITH 
PERISPOMENI AND YPOGEGRAMMENI
+1FF7; 1FF7; 03A9 0342 0345; 03A9 0342 0399; # GREEK SMALL LETTER OMEGA WITH 
PERISPOMENI AND YPOGEGRAMMENI
+
+# 
================================================================================
+# Conditional Mappings
+# The remainder of this file provides conditional casing data used to produce 
+# full case mappings.
+# 
================================================================================
+# Language-Insensitive Mappings
+# These are characters whose full case mappings do not depend on language, but 
do
+# depend on context (which characters come before or after). For more 
information
+# see the header of this file and the Unicode Standard.
+# 
================================================================================
+
+# Special case for final form of sigma
+
+03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
+
+# Note: the following cases for non-final are already in the UnicodeData.txt 
file.
+
+# 03A3; 03C3; 03A3; 03A3; # GREEK CAPITAL LETTER SIGMA
+# 03C3; 03C3; 03A3; 03A3; # GREEK SMALL LETTER SIGMA
+# 03C2; 03C2; 03A3; 03A3; # GREEK SMALL LETTER FINAL SIGMA
+
+# Note: the following cases are not included, since they would case-fold in 
lowercasing
+
+# 03C3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK SMALL LETTER SIGMA
+# 03C2; 03C3; 03A3; 03A3; Not_Final_Sigma; # GREEK SMALL LETTER FINAL SIGMA
+
+# 
================================================================================
+# Language-Sensitive Mappings
+# These are characters whose full case mappings depend on language and perhaps 
also
+# context (which characters come before or after). For more information
+# see the header of this file and the Unicode Standard.
+# 
================================================================================
+
+# Lithuanian
+
+# Lithuanian retains the dot in a lowercase i when followed by accents.
+
+# Remove DOT ABOVE after "i" with upper or titlecase
+
+0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE
+
+# Introduce an explicit dot above when lowercasing capital I's and J's
+# whenever there are more accents above.
+# (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)
+
+0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I
+004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J
+012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH 
OGONEK
+00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE
+00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE
+0128; 0069 0307 0303; 0128; 0128; lt; # LATIN CAPITAL LETTER I WITH TILDE
+
+# 
================================================================================
+
+# Turkish and Azeri
+
+# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
+# The following rules handle those cases.
+
+0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
+0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE
+
+# When lowercasing, remove dot_above in the sequence I + dot_above, which will 
turn into i.
+# This matches the behavior of the canonically equivalent I-dot_above
+
+0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
+0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE
+
+# When lowercasing, unless an I is before a dot_above, it turns into a dotless 
i.
+
+0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
+0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I
+
+# When uppercasing, i turns into a dotted capital I
+
+0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
+0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
+
+# Note: the following case is already in the UnicodeData.txt file.
+
+# 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
+
+# EOF
+
diff --git a/etc/NEWS b/etc/NEWS
index f2bbead..3396f9f 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -235,13 +235,17 @@ same as in modes where the character is not whitespace.
 Instead of only checking the modification time, Emacs now also checks
 the file's actual content before prompting the user.
 
-** Title case characters are properly cased (from and into).
-'upcase', 'upcase-region' et al. convert title case characters (such
-as ǲ) into their upper case form (such as Ǳ).
+** Various casing improvements.
 
-Similarly, 'capitalize', 'upcase-initials' et al. make use of
-title-case forms of initial characters (correctly producing for example
-ǅungla instead of incorrect Ǆungla).
+*** 'upcase', 'upcase-region' et al. convert title case characters
+(such as ǲ) into their upper case form (such as Ǳ).
+
+*** 'capitalize', 'upcase-initials' et al. make use of title-case forms
+of initial characters (correctly producing for example ǅungla instead
+of incorrect Ǆungla).
+
+*** Characters which turn into multiple ones when cased are correctly handled.
+For example, ﬁ ligature is converted to FI when upper cased.
 
 
 * Changes in Specialized Modes and Packages in Emacs 26.1
diff --git a/src/Makefile.in b/src/Makefile.in
index 89f7a92..98a6181 100644
--- a/src/Makefile.in
+++ b/src/Makefile.in
@@ -711,6 +711,9 @@ $(lwlibdir)/TAGS: FORCE
 tags: TAGS ../lisp/TAGS $(lwlibdir)/TAGS
 .PHONY: tags
 
+special-casing.h: make-special-casing.py ../admin/unidata/SpecialCasing.txt
+       $(AM_V_GEN)
+       python $^ $@
 
 ### Bootstrapping.
 
diff --git a/src/casefiddle.c b/src/casefiddle.c
index a016871..35ff674 100644
--- a/src/casefiddle.c
+++ b/src/casefiddle.c
@@ -29,8 +29,16 @@ along with GNU Emacs.  If not, see 
<http://www.gnu.org/licenses/>.  */
 #include "composite.h"
 #include "keymap.h"
 
+struct casing_str_buf {
+  unsigned char data[MAX_MULTIBYTE_LENGTH > 6 ? MAX_MULTIBYTE_LENGTH : 6];
+  unsigned char len_chars;
+  unsigned char len_bytes;
+};
+
 enum case_action {CASE_UP, CASE_DOWN, CASE_CAPITALIZE, CASE_CAPITALIZE_UP};
 
+#include "special-casing.h"
+
 /* State for casing individual characters.  */
 struct casing_context {
   /* A char-table with title-case character mappings or nil.  It being non-nil
@@ -70,25 +78,90 @@ prepare_casing_context (struct casing_context *ctx,
     SETUP_BUFFER_SYNTAX_TABLE ();      /* For syntax_prefix_flag_p.  */
 }
 
-/* Based on CTX, case character CH accordingly.  Update CTX as necessary.
-   Return cased character. */
+/* Based on CTX, case character CH.  If BUF is NULL, return cased character.
+   Otherwise, if BUF is non-NULL, save result in it and return whether the
+   character has been changed.
+
+   Since meaning of return value depends on arguments, it’s more convenient to
+   use case_single_character or case_character instead. */
 static int
-case_character (struct casing_context *ctx, int ch)
+case_character_impl (struct casing_str_buf *buf,
+                    struct casing_context *ctx, int ch)
 {
+  enum case_action flag;
   Lisp_Object prop;
+  bool was_inword;
+  int cased;
 
-  if (ctx->inword)
-    ch = ctx->flag == CASE_CAPITALIZE_UP ? ch : downcase (ch);
+  /* Update inword state */
+  was_inword = ctx->inword;
+  if ((int) ctx->flag >= (int) CASE_CAPITALIZE)
+    ctx->inword = SYNTAX (ch) == Sword &&
+      (!ctx->inbuffer || was_inword || !syntax_prefix_flag_p (ch));
+
+  /* Normalise flag so its one of CASE_UP, CASE_DOWN or CASE_CAPITALIZE. */
+  if (!was_inword)
+    flag = ctx->flag == CASE_UP ? CASE_UP : CASE_CAPITALIZE;
+  else if (ctx->flag != CASE_CAPITALIZE_UP)
+    flag = CASE_DOWN;
+  else
+    {
+      cased = ch;
+      goto done;
+    }
+
+  /* Look through the special casing entries. */
+  if (buf)
+    {
+      const special_casing_char_t *it;
+      for (it = special_casing_code_points; *it && *it <= ch; ++it)
+       if (*it == ch)
+         {
+           const struct casing_str_buf *entry = special_casing_entries +
+             ((it - special_casing_code_points) * 3 + (int)flag);
+           memcpy (buf, entry, sizeof *buf);
+           buf->len_chars &= ~SPECIAL_CASING_NO_CHANGE_BIT;
+           return !(entry->len_chars & SPECIAL_CASING_NO_CHANGE_BIT);
+         }
+    }
+
+  /* Handle simple, one-to-one case. */
+  if (flag == CASE_DOWN)
+    cased = downcase (ch);
   else if (!NILP (ctx->titlecase_char_table) &&
           CHARACTERP (prop = CHAR_TABLE_REF (ctx->titlecase_char_table, ch)))
-    ch = XFASTINT (prop);
+    cased = XFASTINT (prop);
   else
-    ch = upcase(ch);
+    cased = upcase(ch);
+
+  /* And we’re done. */
+ done:
+  if (!buf)
+    return cased;
+  buf->len_chars = 1;
+  buf->len_bytes = CHAR_STRING (cased, buf->data);
+  return cased != ch;
+}
 
-  if ((int) ctx->flag >= (int) CASE_CAPITALIZE)
-    ctx->inword = SYNTAX (ch) == Sword &&
-      (!ctx->inbuffer || ctx->inword || !syntax_prefix_flag_p (ch));
-  return ch;
+/* Based on CTX, case character CH accordingly.  Update CTX as necessary.
+   Return cased character.
+
+   Special casing rules (such as upcase(ﬁ) = FI) are not handled.  For
+   characters whose casing results in multiple code points, the character is
+   returned unchanged. */
+static inline int
+case_single_character (struct casing_context *ctx, int ch)
+{
+  return case_character_impl (NULL, ctx, ch);
+}
+
+/* Save in BUF result of casing character CH.  Return whether casing changed 
the
+   character.  This is like case_single_character but also handles one-to-many
+   casing rules. */
+static inline bool
+case_character (struct casing_str_buf *buf, struct casing_context *ctx, int ch)
+{
+  return case_character_impl (buf, ctx, ch);
 }
 
 static Lisp_Object
@@ -115,7 +188,7 @@ do_casify_integer (struct casing_context *ctx, Lisp_Object 
obj)
               !NILP (BVAR (current_buffer, enable_multibyte_characters)));
   if (! multibyte)
     MAKE_CHAR_MULTIBYTE (ch);
-  cased = case_character (ctx, ch);
+  cased = case_single_character (ctx, ch);
   if (cased == ch)
     return obj;
 
@@ -128,25 +201,34 @@ do_casify_integer (struct casing_context *ctx, 
Lisp_Object obj)
 static Lisp_Object
 do_casify_multibyte_string (struct casing_context *ctx, Lisp_Object obj)
 {
-  ptrdiff_t i, i_byte, size = SCHARS (obj);
-  int len, ch, cased;
+  /* We assume data is the first member of casing_str_buf structure so that if
+     we cast a (char *) into (struct casing_str_buf *) the representation of 
the
+     character is at the beginning of the buffer.  This is why we don’t need
+     separate struct casing_str_buf object but rather write directly to o. */
+  typedef char static_assertion[offsetof(struct casing_str_buf, data) ? -1 : 
1];
+
+  ptrdiff_t size = SCHARS (obj), n;
+  int ch;
   USE_SAFE_ALLOCA;
-  ptrdiff_t o_size;
-  if (INT_MULTIPLY_WRAPV (size, MAX_MULTIBYTE_LENGTH, &o_size))
-    o_size = PTRDIFF_MAX;
-  unsigned char *dst = SAFE_ALLOCA (o_size);
+  if (INT_MULTIPLY_WRAPV (size, MAX_MULTIBYTE_LENGTH, &n) ||
+      INT_ADD_WRAPV (n, sizeof(struct casing_str_buf), &n))
+    n = PTRDIFF_MAX;
+  unsigned char *const dst = SAFE_ALLOCA (n), *const dst_end = dst + n;
   unsigned char *o = dst;
 
-  for (i = i_byte = 0; i < size; i++, i_byte += len)
+  const unsigned char *src = SDATA (obj);
+
+  for (n = 0; size; --size)
     {
-      if (o_size - MAX_MULTIBYTE_LENGTH < o - dst)
+      if (dst_end - o < sizeof(struct casing_str_buf))
        string_overflow ();
-      ch = STRING_CHAR_AND_LENGTH (SDATA (obj) + i_byte, len);
-      cased = case_character (ctx, ch);
-      o += CHAR_STRING (cased, o);
+      ch = STRING_CHAR_ADVANCE (src);
+      case_character ((void *)o, ctx, ch);
+      n += ((struct casing_str_buf *)o)->len_chars;
+      o += ((struct casing_str_buf *)o)->len_bytes;
     }
-  eassert (o - dst <= o_size);
-  obj = make_multibyte_string ((char *) dst, size, o - dst);
+  eassert (o <= dst_end);
+  obj = make_multibyte_string ((char *) dst, n, o - dst);
   SAFE_FREE ();
   return obj;
 }
@@ -162,7 +244,7 @@ do_casify_unibyte_string (struct casing_context *ctx, 
Lisp_Object obj)
     {
       ch = SREF (obj, i);
       MAKE_CHAR_MULTIBYTE (ch);
-      cased = case_character (ctx, ch);
+      cased = case_single_character (ctx, ch);
       if (ch == cased)
        continue;
       MAKE_CHAR_UNIBYTE (cased);
@@ -194,7 +276,9 @@ casify_object (enum case_action flag, Lisp_Object obj)
 DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0,
        doc: /* Convert argument to upper case and return that.
 The argument may be a character or string.  The result has the same type.
-The argument object is not altered--the value is a copy.
+The argument object is not altered--the value is a copy.  If argument
+is a character, characters which map to multiple code points when
+cased, e.g. ﬁ, are returned unchanged.
 See also `capitalize', `downcase' and `upcase-initials'.  */)
   (Lisp_Object obj)
 {
@@ -215,7 +299,9 @@ DEFUN ("capitalize", Fcapitalize, Scapitalize, 1, 1, 0,
 This means that each word's first character is upper case (more
 precisely, if available, title case) and the rest is lower case.
 The argument may be a character or string.  The result has the same type.
-The argument object is not altered--the value is a copy.  */)
+The argument object is not altered--the value is a copy.  If argument
+is a character, characters which map to multiple code points when
+cased, e.g. ﬁ, are returned unchanged.  */)
   (Lisp_Object obj)
 {
   return casify_object (CASE_CAPITALIZE, obj);
@@ -228,7 +314,9 @@ DEFUN ("upcase-initials", Fupcase_initials, 
Supcase_initials, 1, 1, 0,
 (More precisely, if available, initial of each word is converted to
 title-case).  Do not change the other letters of each word.
 The argument may be a character or string.  The result has the same type.
-The argument object is not altered--the value is a copy.  */)
+The argument object is not altered--the value is a copy.  If argument
+is a character, characters which map to multiple code points when
+cased, e.g. ﬁ, are returned unchanged.  */)
   (Lisp_Object obj)
 {
   return casify_object (CASE_CAPITALIZE_UP, obj);
@@ -250,7 +338,7 @@ do_casify_unibyte_region (struct casing_context *ctx,
       ch = FETCH_BYTE (pos);
       MAKE_CHAR_MULTIBYTE (ch);
 
-      cased = case_character (ctx, ch);
+      cased = case_single_character (ctx, ch);
       if (cased == ch)
        continue;
 
@@ -271,48 +359,54 @@ do_casify_unibyte_region (struct casing_context *ctx,
    characters were changed, return -1 and *ENDP is unspecified. */
 static ptrdiff_t
 do_casify_multibyte_region (struct casing_context *ctx,
-                           ptrdiff_t pos, ptrdiff_t *endp)
+                           ptrdiff_t *startp, ptrdiff_t *endp)
 {
   ptrdiff_t first = -1, last = -1;  /* Position of first and last changes. */
-  ptrdiff_t pos_byte = CHAR_TO_BYTE (pos), end = *endp;
-  ptrdiff_t opoint = PT;
+  ptrdiff_t pos = *startp, pos_byte = CHAR_TO_BYTE (pos), size = *endp - pos;
+  ptrdiff_t opoint = PT, added;
+  struct casing_str_buf buf;
   int ch, cased, len;
 
-  while (pos < end)
+  for (; size; --size)
     {
       ch = STRING_CHAR_AND_LENGTH (BYTE_POS_ADDR (pos_byte), len);
-      cased = case_character (ctx, ch);
-      if (cased != ch)
+
+      if (!case_character (&buf, ctx, ch))
        {
-         last = pos;
-         if (first < 0)
-           first = pos;
-
-         if (ASCII_CHAR_P (cased) && ASCII_CHAR_P (ch))
-           FETCH_BYTE (pos_byte) = cased;
-         else
-           {
-             unsigned char str[MAX_MULTIBYTE_LENGTH];
-             int totlen = CHAR_STRING (cased, str);
-             if (len == totlen)
-               memcpy (BYTE_POS_ADDR (pos_byte), str, len);
-             else
-               /* Replace one character with the other(s), keeping text
-                  properties the same.  */
-               replace_range_2 (pos, pos_byte, pos + 1, pos_byte + len,
-                                (char *) str, 9, totlen, 0);
-             len = totlen;
-           }
+         pos_byte += len;
+         ++pos;
+         continue;
        }
-      pos++;
-      pos_byte += len;
+
+      last = pos + buf.len_chars;
+      if (first < 0)
+       first = pos;
+
+      if (buf.len_chars == 1 && buf.len_bytes == len)
+       memcpy (BYTE_POS_ADDR (pos_byte), buf.data, len);
+      else
+       {
+         /* Replace one character with the other(s), keeping text
+            properties the same.  */
+         replace_range_2 (pos, pos_byte, pos + 1, pos_byte + len,
+                          (const char *) buf.data, buf.len_chars,
+                          buf.len_bytes,
+                          0);
+         added += buf.len_chars - 1;
+         if (opoint > pos)
+           opoint += buf.len_chars - 1;
+       }
+
+      pos_byte += buf.len_bytes;
+      pos += buf.len_chars;
     }
 
   if (PT != opoint)
     TEMP_SET_PT_BOTH (opoint, CHAR_TO_BYTE (opoint));
 
+  *startp = first;
   *endp = last;
-  return first;
+  return added;
 }
 
 /* flag is CASE_UP, CASE_DOWN or CASE_CAPITALIZE or CASE_CAPITALIZE_UP.
@@ -320,8 +414,8 @@ do_casify_multibyte_region (struct casing_context *ctx,
 static void
 casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e)
 {
+  ptrdiff_t start, end, added;
   struct casing_context ctx;
-  ptrdiff_t start, end;
 
   if (EQ (b, e))
     /* Not modifying because nothing marked */
@@ -337,12 +431,12 @@ casify_region (enum case_action flag, Lisp_Object b, 
Lisp_Object e)
   if (NILP (BVAR (current_buffer, enable_multibyte_characters)))
     start = do_casify_unibyte_region (&ctx, start, &end);
   else
-    start = do_casify_multibyte_region (&ctx, start, &end);
+    added = do_casify_multibyte_region (&ctx, &start, &end);
 
   if (start >= 0)
     {
-      signal_after_change (start, end + 1 - start, end + 1 - start);
-      update_compositions (start, end + 1, CHECK_ALL);
+      signal_after_change (start, end - start - added, end - start);
+      update_compositions (start, end, CHECK_ALL);
     }
 }
 
diff --git a/src/deps.mk b/src/deps.mk
index 72f68ca..1c24414 100644
--- a/src/deps.mk
+++ b/src/deps.mk
@@ -49,7 +49,7 @@ callproc.o: callproc.c epaths.h buffer.h commands.h lisp.h 
$(config_h) \
    composite.h w32.h blockinput.h atimer.h systime.h frame.h termhooks.h \
    buffer.h gnutls.h dispextern.h ../lib/unistd.h globals.h
 casefiddle.o: casefiddle.c syntax.h commands.h buffer.h character.h \
-   composite.h keymap.h lisp.h globals.h $(config_h)
+   composite.h keymap.h special-casing.h lisp.h globals.h $(config_h)
 casetab.o: casetab.c buffer.h character.h lisp.h globals.h $(config_h)
 category.o: category.c category.h buffer.h charset.h keymap.h  \
    character.h lisp.h globals.h $(config_h)
diff --git a/src/make-special-casing.py b/src/make-special-casing.py
new file mode 100644
index 0000000..e8725e3
--- /dev/null
+++ b/src/make-special-casing.py
@@ -0,0 +1,189 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+"""generate-special-casing.py --- generate special-casing.h file
+Copyright (C) 2016 Free Software Foundation, Inc.
+
+This file is part of GNU Emacs.
+
+GNU Emacs is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or (at
+your option) any later version.
+
+GNU Emacs is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.
+"""
+
+import os
+import re
+import sys
+import tempfile
+import textwrap
+
+TEMPLATE = '''\
+/* Special case mapping rules.  Only unconditional entries are included.
+   This file is automatically generated from SpecialCasing.txt file
+   distributed with Unicode standard by %(generator)s.
+   Do not edit manually. */
+
+#include <stdint.h>
+
+struct special_casing_static_asserts {
+%(asserts)s
+};
+
+typedef %(code_point_type)s special_casing_char_t;
+
+/* Zero-terminated, sorted list. */
+static const special_casing_char_t special_casing_code_points[] = {
+%(code_points)s
+};
+
+/* If buf.len_chars has this bit set, the character maps to itself. */
+#define SPECIAL_CASING_NO_CHANGE_BIT 0x80
+
+static const struct casing_str_buf special_casing_entries[] = {
+%(entries)s
+};
+'''
+
+MAX_DATA_BYTES_LENGTH = 6
+
+ASSERTS = (
+    ('casing_str_buf_data_must_be_at_least_%d_chars' % MAX_DATA_BYTES_LENGTH,
+     'sizeof ((struct casing_str_buf*)0)->data >= %d' % MAX_DATA_BYTES_LENGTH),
+    ('CASE_UP_must_equal_0', 'CASE_UP == 0'),
+    ('CASE_DOWN_must_equal_1', 'CASE_DOWN == 1'),
+    ('CASE_CAPITALIZE_must_equal_2', 'CASE_CAPITALIZE == 2')
+)
+
+
+def encode(code, code_points):
+    """Convert a space-separated list of code points into UTF-8 C-string.
+
+    Args:
+        code: Code point this mapping is for.
+        code_points: A space-separated list of hexadecimal numbers representing
+            code points in desired representation.
+
+    Returns:
+        A (literal, len_chars, len_bytes) tuple.  len_chars may be zero if code
+        point maps to itself.
+    """
+    code_points = [int(cp, 16) for cp in code_points.split()]
+    len_chars = len(code_points)
+    if len_chars == 1 and code_points[0] == code:
+        len_chars = len_chars | 0x80
+    val = ''.join(unichr(cp) for cp in code_points).encode('utf-8')
+    ret = ''
+    for ch in val:
+        o = ord(ch)
+        if o < 32 or o >= 127:
+            ch = '\\x%02x' % o
+        ret += ch
+    return '"%s"' % ret, len_chars, len(val)
+
+
+def read_entries(fd):
+    """Read entries from SpecialCasing.txt file.
+
+    Conditional entries are ignored.
+
+    Args:
+        fd: File object to read data from.
+
+    Returns:
+        A list of [code, up_lit, up_len, down_lit, down_len, title_lit,
+        title_len, comment] lists.
+    """
+    idx_lower = 1
+    idx_title = 2
+    idx_upper = 3
+    # This order must match CASE_UP, CASE_DOWN and CASE_CAPITALIZE_UP from
+    # casefiddle.c.  This ordering (which is different than in 
SpecialCasing.txt
+    # file) is checked via static asserts included in the generated file.
+    indexes = (idx_upper, idx_lower, idx_title)
+
+    entries = []
+    for line in fd:
+        line = line.strip()
+        if not line or line[0] == '#':
+            continue
+
+        line = re.split(r';\s*', line)
+        if len(line) == 6:
+            # Conditional special casing don’t go into special-casing.h
+            #sys.stderr.write('make-special-casing: %s: conditions present '
+            #                 '(%s), ignoring\n' % (line[0], line[4]))
+            continue
+
+        code = int(line[0], 16)
+        entry = [code]
+
+        for i in indexes:
+            val = encode(code, line[i])
+            entry.append(val)
+            if val:
+                # The data structure we’re using assumes that all C strings are
+                # no more than six bytes (excluding NUL terminator).  Enforce
+                # that here.
+                assert val[2] <= MAX_DATA_BYTES_LENGTH, (code, i, val)
+
+        entry.append(line[4].strip(' #').capitalize())
+        entries.append(entry)
+
+    entries.sort()
+    return entries
+
+
+def format_output(entries):
+    # If all code points are 16-bit prefer using uint16_t since it makes the
+    # array smaller and more cache friendly.
+    if all(entry[0] <= 0xffff for entry in entries):
+        cp_type, cp_fmt = 'uint16_t', '%04X'
+    else:
+        cp_type, cp_fmt = 'uint32_t', '%06X'
+
+    fmt = '0x%sU' % cp_fmt;
+    code_points = ', '.join(fmt % entry[0] for entry in entries)
+
+    lines = []
+    for entry in entries:
+        lines.append('    /* U+%s %%s */' % cp_fmt % (entry[0], entry[4]))
+        for val in entry[1:4]:
+            lines.append('    { %s, %d, %d },' % val)
+    lines[-1] = lines[-1].rstrip(',')
+
+    return TEMPLATE % {
+        'generator': os.path.basename(__file__),
+        'asserts': '\n'.join('  char %s[%s ? 1 : -1];' % p for p in ASSERTS),
+        'code_point_type': cp_type,
+        'code_points': textwrap.fill(code_points + ', 0', width=80,
+                                     initial_indent='    ',
+                                     subsequent_indent='    '),
+        'entries': '\n'.join(lines)
+    }
+
+
+def main(argv):
+    if len(argv) != 3:
+        sys.stderr.write('usage: %s SpecialCasing.txt special-casing.h\n' %
+                         argv[0])
+        sys.exit(1)
+
+    with open(argv[1]) as fd:
+        entries = read_entries(fd)
+
+    data = format_output(entries)
+
+    with open(argv[2], 'w') as fd:
+        fd.write(data)
+
+
+if __name__ == '__main__':
+    main(sys.argv)
diff --git a/test/lisp/char-fold-tests.el b/test/lisp/char-fold-tests.el
index 485254a..821c701 100644
--- a/test/lisp/char-fold-tests.el
+++ b/test/lisp/char-fold-tests.el
@@ -54,6 +54,14 @@ char-fold--test-search-with-contents
        (concat w1 "\s\n\s\t\f\t\n\r\t" w2)
        (concat w1 (make-string 10 ?\s) w2)))))
 
+(defun char-fold--ascii-upcase (string)
+  "Like `upcase' but acts on ASCII characters only."
+  (replace-regexp-in-string "[a-z]+" 'upcase string))
+
+(defun char-fold--ascii-downcase (string)
+  "Like `downcase' but acts on ASCII characters only."
+  (replace-regexp-in-string "[a-z]+" 'downcase string))
+
 (defun char-fold--test-match-exactly (string &rest strings-to-match)
   (let ((re (concat "\\`" (char-fold-to-regexp string) "\\'")))
     (dolist (it strings-to-match)
@@ -61,8 +69,8 @@ char-fold--test-match-exactly
     ;; Case folding
     (let ((case-fold-search t))
       (dolist (it strings-to-match)
-        (should (string-match (upcase re) (downcase it)))
-        (should (string-match (downcase re) (upcase it)))))))
+        (should (string-match (char-fold--ascii-upcase re) (downcase it)))
+        (should (string-match (char-fold--ascii-downcase re) (upcase it)))))))
 
 (ert-deftest char-fold--test-some-defaults ()
   (dolist (it '(("ffl" . "ﬄ") ("ffi" . "ﬃ")
diff --git a/test/src/casefiddle-tests.el b/test/src/casefiddle-tests.el
index def74a0..ae557d7 100644
--- a/test/src/casefiddle-tests.el
+++ b/test/src/casefiddle-tests.el
@@ -143,16 +143,13 @@ casefiddle-tests--characters
               ("ǄUNGLA" "ǄUNGLA" "ǆungla" "ǅungla" "ǅUNGLA")
               ("ǅungla" "ǄUNGLA" "ǆungla" "ǅungla" "ǅungla")
               ("ǆungla" "ǄUNGLA" "ǆungla" "ǅungla" "ǅungla")
+              ("deﬁne" "DEFINE" "deﬁne" "Deﬁne" "Deﬁne")
+              ("ﬁsh" "FISH" "ﬁsh" "Fish" "Fish")
+              ("Straße" "STRASSE" "straße" "Straße" "Straße")
               ;; FIXME: Everything below is broken at the moment.  Here’s what
               ;; should happen:
-              ;;("deﬁne" "DEFINE" "deﬁne" "Deﬁne" "Deﬁne")
-              ;;("ﬁsh" "FIsh" "ﬁsh" "Fish" "Fish")
-              ;;("Straße" "STRASSE" "straße" "Straße" "Straße")
               ;;("ΌΣΟΣ" "ΌΣΟΣ" "όσος" "Όσος" "Όσος")
               ;; And here’s what is actually happening:
-              ("deﬁne" "DEﬁNE" "deﬁne" "Deﬁne" "Deﬁne")
-              ("ﬁsh" "ﬁSH" "ﬁsh" "ﬁsh" "ﬁsh")
-              ("Straße" "STRAßE" "straße" "Straße" "Straße")
               ("ΌΣΟΣ" "ΌΣΟΣ" "όσοσ" "Όσοσ" "ΌΣΟΣ")
 
               ("όσος" "ΌΣΟΣ" "όσος" "Όσος" "Όσος"))
-- 
2.8.0.rc3.226.g39d4020
[Prev in Thread]
Current Thread
[Next in Thread]
bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data, (continued)
- bug#24603: [PATCH 0/3] Case table updates, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 3/3] Don’t generate ‘X maps to X’ entries in case tables, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 1/3] Add tests for casefiddle.c, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 2/3] Generate upcase and downcase tables from Unicode data, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 0/3] Case table updates, Eli Zaretskii, 2016/10/18
    - bug#24603: [PATCH 0/3] Case table updates, Michal Nazarewicz, 2016/10/24
    - bug#24603: [PATCH 0/3] Case table updates, Eli Zaretskii, 2016/10/24
Prev by Date: bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling when casing strings
Next by Date: bug#24600: 25.2.50; Improper unicode display on MacOS
Previous by thread: bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling when casing strings
Next by thread: bug#24603: [RFC 08/18] Support casing characters which map into multiple code points
Index(es):
- Date
- Thread