[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#37009: EWW Gets Confused on Invalid HTML
From: |
Lars Ingebrigtsen |
Subject: |
bug#37009: EWW Gets Confused on Invalid HTML |
Date: |
Tue, 13 Aug 2019 11:45:22 -0700 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) |
Eli Zaretskii <eliz@gnu.org> writes:
>> I'm not sure how feasible it will be to fix this at all. Eww relies on
>> libxml for parsing, and it's not as flexible as a typical web browser:
>>
>> (with-temp-buffer
>> (insert "<html>
>> <body>abc <- xyz<body>
>> </html>")
>> (libxml-parse-html-region (point-min) (point-max)))
>>
>> ;=> (html nil (body nil "abc\n"))
>
> Maybe we should report this to libxml developers and hear their
> opinion?
If libxml2 would add the standard work-arounds that most browsers use to
handle invalid HTML, that would be nice.
But it's not that difficult to add some pre-processing to handle the
most common cases ourselves.
For instance, if what follows the < isn't a letter (or an exclamation
point), then it should probably be < instead. That would have fixed
the problem in this case, and is something I think shr should do.
But you can go pretty far down the rabbit hole in being lenient with
invalid HTML, and I think it's probably best not to go any further down
that road.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no