[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: patch: set id attribute for part in DocBook
From: |
Patrice Dumas |
Subject: |
Re: patch: set id attribute for part in DocBook |
Date: |
Tue, 11 Nov 2014 10:31:41 +0100 |
User-agent: |
Mutt/1.5.20 (2009-12-10) |
On Mon, Nov 10, 2014 at 10:29:21AM -0800, Per Bothner wrote:
> I think these are 3 different questions:
>
> (1) Should DocBook output contain id attributes for @part commands?
>
> IMO, yes.
Agreed.
> (2) Should those id attributes be "mangled" using the same algorithm that
> id attributes from nodes are?
>
> I don't know of any reason why not.
Filenames and section id are not mangled the same as nodes, since they
cannot be target from other manuals, and sections need not be unique
which imply that disambiguation is required anyway. If we add id for
@part, I think I'll use the same as for sections.
> (3) Should we be less restrictive in what we allow in id attributes?
>
> I think that would be reasonable - though it might break compatibility.
>
> It seems wrong to mangle perfectly-reasonable non-ascii letters.
> In principle the id attribute can be any valid XML Name. If we mangle
> 'à' we're both losing information and making the output uglier. If we
We do not lose information, the mangling scheme is a bijection with a
unicode string. I don't think that there is code to demangle it, but we
could in theory. With transliteration, for section id, even if not a
bijection, -1 .. -n is prepended to make the id uniques, so there is no
loss of information.
> want to restrict filenames, that should be done by the DocBook processor
> (or a transformation stage between makeinfo and DocBook). However, all
> valid XML Names are valid filenames on modern desktop and server
> systems, so such mangling is not needed. Likewise, web servers and
> browsers can transparently mangle and demangle non-ascii URLs,
> so we join the 21st century, and not deal with it. (Maybe I'm
> being overly optimistic ...)
That's not my philosophy. Especially for computer generated identifiers
like those, backward compatibility is more important to me than having
human readable identifier in urls. That being said, we could have
different levels of compatibility/mangling.
> Possible exceptions: XML Names allow '.' and ':' - it might be reasonable
> to convert those to '_'.
Not to _, to a unique value _XXXX, or something like that.
> My conclusion: The goal should be to generate the simplest and most
> minimal mangling to produce a valid and human-readable XML Name.
For HTML, we use id/name for cross manual references. This must thus be
independent of the document encoding as much as possible. Using ascii
only seems to me the best bet in that regard (though not foolproof,
there exist non ascii compatible encodings). Maybe instead of _XXXX
we could use entities. The url could become more human readable this way
when interpreted by a browser, but it will be even less human readable
when reading with an editor. But if we do that, and change the url
scheme, then we break all the cross-manual references, including to
manuals that where generated a long time ago when the cross-manual
specification was designed.
As a side note, we already transliterate in file names, so we are not as
certain for file name as we are for name/id that the result will be
identical. Though it would be pretty rare to have something different,
as we depend on a perl module that is quite stable and I am not aware of
another translator to HTML that would not depend on that same perl
module.
> I'm a big believer in "clean URLs". GNU should aim for that.
>
> The same logic would apply to html and xml output, FWIW.
Currently, the id and name in HTML are designed to comply with XHTML 4.??
which is very restrictive. For XML output, we decide what is permitted
in the dtd, but I think that it makes sense to use the mangled form
which is also the internal representation of nodes.
I searched a bit on the web, the restriction seems to be:
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
followed by any number of letters, digits ([0-9]), hyphens ("-"),
underscores ("_"), colons (":"), and periods (".").
I don't remember why we didn't use ":" and ".", maybe simply me ignoring
that we could, maybe compatibility with something older. In any case I
don't think it would be that relevant to keep : and . as is while
everything else is mangled.
--
Pat