bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#37009: EWW Gets Confused on Invalid HTML


From: Lars Ingebrigtsen
Subject: bug#37009: EWW Gets Confused on Invalid HTML
Date: Tue, 13 Aug 2019 11:45:22 -0700
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux)

Eli Zaretskii <eliz@gnu.org> writes:

>> I'm not sure how feasible it will be to fix this at all.  Eww relies on
>> libxml for parsing, and it's not as flexible as a typical web browser:
>> 
>>     (with-temp-buffer
>>       (insert "<html>
>>       <body>abc <- xyz<body>
>>     </html>")
>>       (libxml-parse-html-region (point-min) (point-max)))
>> 
>>     ;=> (html nil (body nil "abc\n"))
>
> Maybe we should report this to libxml developers and hear their
> opinion?

If libxml2 would add the standard work-arounds that most browsers use to
handle invalid HTML, that would be nice.

But it's not that difficult to add some pre-processing to handle the
most common cases ourselves.

For instance, if what follows the < isn't a letter (or an exclamation
point), then it should probably be &lt; instead.  That would have fixed
the problem in this case, and is something I think shr should do.

But you can go pretty far down the rabbit hole in being lenient with
invalid HTML, and I think it's probably best not to go any further down
that road.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





reply via email to

[Prev in Thread] Current Thread [Next in Thread]