discuss-gnustep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNUstep Web browser (was Re: WebKit Bounty)


From: Robert Slover
Subject: Re: GNUstep Web browser (was Re: WebKit Bounty)
Date: Sun, 04 Mar 2007 20:49:59 -0500


On Mar 4, 2007, at 2:12 AM, Gregory John Casamento wrote:

Rogelio,

... [elided] ...

If html is so easy to do wrong and so hard to handle then we put a
bullet in the s*****'s head  and move on.

It's not that easy... it's nice to say that we will make a parser that will only handle "correct" HTML, but when you consider that this will make the browser virtually useless for navigating almost half of the web pages out there, the idea looses it's appeal. If you write a from scratch implementation you will need to handle such pages, if you want anyone to actually use it.

Later, GJC
... [elided] ...

I do not know if this helps or not, but I'll make the suggestion anyway. Several years ago I needed a parser for a project at work that could help extract all of the links and URL references in a set of related HTML documents, then let me re-write the documents. This had two purposes -- rewriting a set of HTML pages as a multi-part related MIME message including all images and directly related documents for emailing, and 'retargeting' -- moving a set of related HTML pages into an altered hierarchy simply by describing the relationships between two hierarchies (from the one used in our application to the one used by an arbitrary customer Intranet) and a starting point. The real monkey wrench was that the HTML was often very sloppy, containing fragments of HTML customers had entered themselves to customize the output, as well as incorrect HTML produced by 3rd-party software modules (which we had source to, but no budgeted time to fix). While the latter we could do something about, the former we could not. My solution was to use HTML-Tidy, a W3C project by Dave Ragget. ( http://www.w3.org/People/Raggett/tidy/ ). There was a project underway at the time to turn Tidy into a library, but it still had a way to go -- so, instead, one of our developers took about 3 days and turned it into a library suitable to our purpose that worked where we needed it to -- AIX and Solaris. He gave it an interface that was very much like SAX, on top of which we wrote our logic to re-write pages on the fly. The Tidy code was very clean and easy to understand C, so this was a straightforward endeavor. We were then able to handle broken pages, with the added advantage that pages that were externalized by the application in this way were also "correct" HTML, regardless of fragmentary or incorrect input. This has worked so well that we've not had to touch it since (5 or 6 years).

There, of course, now exists the official TidyLib, which I do not know a lot about, but it could be a useful tool in getting from the point of having a renderer that works with correct HTML/XML to one that can understand the bulk of the incorrect HTML that exists in the real world.

--Robert





reply via email to

[Prev in Thread] Current Thread [Next in Thread]