guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Using shtml with htmlprag - output of shtml->html is different to some g


From: Kenan Toker
Subject: Using shtml with htmlprag - output of shtml->html is different to some given HTML
Date: Wed, 4 Sep 2019 22:52:54 +1000

Hi guile-users,

Hope you're all very well! I have a question about using shtml with
htmlprag - as far as I know this module isn't actually part of Guile,
and it looks like it's quite old now and maybe no longer under active
development, but if anyone has any insights I'm keen to see if I can get
another set of eyes on this issue I'm having.

I'm new to Guile, and to learn the language I'm building a web crawler.
As part of this, I'm using htmlprag and sxpath to convert some HTML to
shtml and pull some interesting data out of the shtml.

I have the following HTML (I wrote this up for example's sake):

    <!DOCTYPE html>
    <html>
      <head>
        <title>Example</title>
      </head>
      <body>
        <header class="exampleHeader">
          <img id="bannerImage"
    src="https://www.gnu.org/software/guile/static/base/img/branding.png";>
          <div>
            <p id="labelName">A label for the header.</p>
          </div>
          <p id="labelDescription">Some description of the header.</p>
        </header>
        <div id="exampleDiv">
          <hr>
          <div id="divMessage">An example message.</div>
        </div>
        <footer id="footer"></footer>
      </body>
    </html>

When I used html->shtml I got the following shtml:

    (*TOP* (*DECL* DOCTYPE html)
     (html
        (head
          (title Example)
       )
        (body
          (header (@ (class exampleHeader))
            (img (@ (id bannerImage) (src
    https://www.gnu.org/software/guile/static/base/img/branding.png)))
            (div
             )) (p (@ (id labelName)) A label for the header.)
           
            (p (@ (id labelDescription)) Some description of the header.)
         
          (div (@ (id exampleDiv))
            (hr)
            (div (@ (id divMessage)) An example message.)
         )
          (footer (@ (id footer)))
       )
    )
    )

I would have however expected something like (div (p (@ (id labelName))
A label for the header.)) under the header[@class="exampleHeader"] tag
(I haven't tested this exact s-expression though). Instead, the p tag
sits outside the div tag.

When I do shtml->html over this shtml, I get the following html:

    <!DOCTYPE html>
    <html>
      <head>
        <title>Example</title>
      </head>
      <body>
        <header class="exampleHeader">
          <img id="bannerImage"
    src="https://www.gnu.org/software/guile/static/base/img/branding.png"; />
          <div>
            </div></header><p id="labelName">A label for the header.</p>
         
          <p id="labelDescription">Some description of the header.</p>
       
        <div id="exampleDiv">
          <hr />
          <div id="divMessage">An example message.</div>
        </div>
        <footer id="footer"></footer>
      </body>
    </html>

The p[@id="labelName"] tag no longer sits under the div tag. This means
when I use an sxpath expression like '(// html body (header (@ (eq?
"exampleHeader")))), I get the img tag and an empty div tag, but no p
tag - like so:

    ((header (@ (class exampleHeader))
            (img (@ (id bannerImage) (src
    https://www.gnu.org/software/guile/static/base/img/branding.png)))
            (div
             )))

I'm wondering if I've missed something, or if others get this kind of
behaviour. The upshot of this is that, for the HTML above, it looks like
(equal? example-html (shtml->html (html->shtml example-html))) is false,
which isn't what I'd expect. Is there something funny that happens with
`p`?

Thanks a lot,
Kenan


NB. In the sxml example above all the strings aren't surrounded by
double quotes, but I think this is an artefact of how I'm writing them
to files for testing purposes - see an extract of the sxml below when I
use ,pretty-print in Geiser:

    (div (@ (id "exampleDiv"))
                "\n"
                "      "
                (hr)
                "\n"
                "      "
                (div (@ (id "divMessage")) "An example message.")
                "\n"
                "    ")

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]