[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not p
From: |
Lars Ingebrigtsen |
Subject: |
bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region |
Date: |
Wed, 29 Jul 2020 07:35:51 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) |
I had a look at the libxml2 sources. The logic isn't really explained,
but apparently they include all the <255-value entities, and then a
selected number of the other entities (about 160 of them).
I have no idea what the logic behind this is... perhaps they've just
forgotten to add the new ones? Which makes me think that this is really
a libxml2 bug, and you should report it there instead.
Excerpt:
/************************************************************************
* *
* The list of HTML predefined entities *
* *
************************************************************************/
static const htmlEntityDesc html40EntitiesTable[] = {
/*
* the 4 absolute ones, plus apostrophe.
*/
{ 34, "quot", "quotation mark = APL quote, U+0022 ISOnum" },
{ 38, "amp", "ampersand, U+0026 ISOnum" },
{ 39, "apos", "single quote" },
{ 60, "lt", "less-than sign, U+003C ISOnum" },
{ 62, "gt", "greater-than sign, U+003E ISOnum" },
/*
* A bunch still in the 128-255 range
* Replacing them depend really on the charset used.
*/
{ 160, "nbsp", "no-break space = non-breaking space, U+00A0 ISOnum" },
{ 161, "iexcl","inverted exclamation mark, U+00A1 ISOnum" },
{ 162, "cent", "cent sign, U+00A2 ISOnum" },
[...]
{ 376, "Yuml", "latin capital letter Y with diaeresis, U+0178 ISOlat2" },
/*
* Anything below should really be kept as entities references
*/
{ 402, "fnof", "latin small f with hook = function = florin, U+0192 ISOtech" },
{ 710, "circ", "modifier letter circumflex accent, U+02C6 ISOpub" },
{ 732, "tilde","small tilde, U+02DC ISOdia" },
{ 913, "Alpha","greek capital letter alpha, U+0391" },
{ 914, "Beta", "greek capital letter beta, U+0392" },
{ 915, "Gamma","greek capital letter gamma, U+0393 ISOgrk3" },
{ 916, "Delta","greek capital letter delta, U+0394 ISOgrk3" },
[...]
{ 9824, "spades","black spade suit, U+2660 ISOpub" },
{ 9827, "clubs","black club suit = shamrock, U+2663 ISOpub" },
{ 9829, "hearts","black heart suit = valentine, U+2665 ISOpub" },
{ 9830, "diams","black diamond suit, U+2666 ISOpub" },
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no