gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] Re: File-tpye plug-in architecture for Arch?


From: michael josenhans
Subject: [Gnu-arch-users] Re: File-tpye plug-in architecture for Arch?
Date: Tue, 23 Dec 2003 00:16:53 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031016

Tom Lord wrote:
    > From: michael josenhans <address@hidden>

    >> Pick your favorite generic XML-diff/patch tool and tell me under what
>> conditions it is
    >> (a) guaranteed to produce an XML document valid under that DTD when
    >>     applying diffs between meta.xml versions A and B to a meta.xml
    >>     version C.

    > That is what I would expect from such tools. I think some tools
    > I have seen do this already.

    > An application must be able to cope with any valid meta.xml
    > file.

Let's start with three XML documents, A, B and C.

All three documents conform to a particular DTD -- we could be
specific and say "The DTD for meta.xml files in OpenOffice documents".

We run something like:

        % xmldiff A B > A-B.xmldiff

        % xmlpatch A-B.xmldiff C > D

Is D _guaranteed_ to be valid according to the DTD?

xml-diff
--------
I would expect any diff tool should to complain, if it is used for diffing between 2 XML files, which reference different DTDs.

<!DOCTYPE office:document-content PUBLIC "-//OpenOffice.org//DTD OfficeDocument1.0//EN" "office.dtd">

Usually, as in e.g. HTML, the DOCTYPE contains the URL, where the DTD is found in the Internet. Alternatively it could point to a filename in the directory.

You point out a very good flaw in the OO-format. The DTD is neighter a public URL not within the oo-file zip archive..

This caused 2 of the diff tools I used to complain. 'diffxml' did not diff until I had copied the file DTD from the OO-program directory into the same directory as the file to be diffed.

The 'deltaxml' diff web was not able to upload a local dtd and asked to put the DTD on a public URL. As I was not able to do that I had to remove the '<!DOCTYPE ..>' directive from the xml-files.

Thus:
- A copy of the DTD belongs into the branch archive.
- When adding patches, the DTDs need to become diffed.

Useful additional file specific info not found in the DTD:

Ordering and sorting information where a xml-diff tool could benefit from:

Depending of the ordering-type of the node,

1)
<blub>
        <bla> eins </bla>
        <bla> zwei </bla>
<blub>
and

2)
<blub>
        <bla> zwei </bla>
        <bla> eins </bla>
<blub>

might be identical or not. In many cases, the order does not care. In those cases the diff above should be 'Zero'.

If it cares, diff information must be able to reflect the change.

If however the order cares, the diff must not be empty. Note that a traditional patch would never gain this capability.


xml-patch
---------
When applying patches, xml-patch tools shall use the DTD in order to generate a valid XML-document.

Example:

file a)

<color-list>
        <color-item color='blue' />
</color-list>

file b)

<color-list>
        <color-item color='red' />
</color-list>


If the DTD would indicate the colorlist could either be empty or contain one 'color-item' child, patch should complain about a merge conflict and ask the user to resolve.

If the DTD would allow more than one color-item in the list, it could patch could benefit from additional ordering information.

If the order is irrelevant patch would decide for one of the items below by itself. It would not indicate a merge conflict.

E.g.

<color-list>
        <color-item color='blue' />
        <color-item color='red' />
</color-list>

or

<color-list>
        <color-item color='red' />
        <color-item color='blue' />
</color-list>

If the order would be relevant, the user would be asked to resolve the conflict with xml-diff3 or equivalent.

Summary:

Any good XML patch should take the DTD into account when merging. If it does not, you will be required to do more manual merging effort than needed.


Note:
XML validator tools are used to verify, if a document is in line with it's DTD.

Why does this matter?  Because applications that process OpenOffice
files _as_ OpenOffice files are only required to read meta.xml files
that _do_ pass the DTD.  Even if they are tolerant of meta.xml files
that do not pass the DTD, it is impossible that they can do anything
generally useful with them (for a non-programmer or programmer who is
not expert in the meta.xml DTD).

It is _essentially_impossible_ that you have seen generic XML
diff/patch tools which guarantee that D will conform to the DTD (with
the exception of a small subset of possible DTD's) and simultaneously
that they actually provide diff/patch functionality.  There is not
enough information present in the DTD, A, B, and C to allow such tools
to work in general.

There are only two ways such a tool can work in this case:

(1) if it knows _more_ about the meta.xml format than is present
    in the files or the DTD --- if it is not an XML-diff/patch but
    is instead a meta.xml-diff/patch.  This is in fact what I propose
    that people interested in these matters set about to build -- and
    my advice is not leap to the conclusion that it is a simple hack
    on top of a generic XML-diff/patch tool.

As said above. The DTD of a e.g. meta.xml should be defined in the 'DOCTYPE' directive of the meta.xml file. Usually the DTD would then be called 'meta.dtd'. This is not very human readable with OO, as they are taking excessively use of XML name spaces.

XML-diff and XML-patch would take meta.dtd into account.

(2) If the generic XML-diff/patch algorithm is standard, well
    described, and easy to reason about mathematically -- and the
    designers of the meta.xml format thought about it carefully while
    designing their DTD.

The irony here is that the meta.xml format itself is sufficiently
simple that quite possibly, simply by coincidence, the outcome is the
same as if (2) had taken place for some J. Random XML-diff/patch tool.
While that _might_ be true, it is unlikely that the coincidence all
applies to other OpenOffice document components such as style.xml and
content.xml.

Finally, and _again_, mere validity of the output is probably not a
useful level of functionality.   More likely, we'd want the diff/patch
tool to introduce _entirely_new_ markup to explain to a user what it
has done.   But you at least agree on that point:


>> (c) guaranteed to produce a meta.xml output which is not only valid >> but useful

    > Defintely not.


Regarding:

    >> (d) guaranteed to produce byte-wise identical-to-B output when
    >>     applying the diff between versions A and B to A.

    > It will generate an XML equivalent XML file.

In other words, the guarantee is not provided.   This is actually a
fairly serious drawback.


    > xml-diff(patch( A, xml-diff(A,B)), B) = empty.

Yes, but (for example):

        % xml-diff A B | patch A | md5sum > ,x
        % md5sum B > ,y
        % cmp ,x ,y || echo bzzzzt -- this blows
        bzzzzt -- this blows

Good point. Checksums are needed for having trust and confidence.

We have moved here from the 'text-diff' space to the 'xml-diff' space.

What you need here for XML files is:

xml-md5sum

The result of 'xml-md5sum' would be invariant of any differences irrelevant to XML and eventually to unordered nodes. Likely it would be a some kind o md5 sum about the content in all nodes.

I can not judge the problems it makes to build such a tool yet. I did not find such a tool on the web.


Except for the false one regarding (a).  And the statement regarding
(d) is not exactly an endorsement of the approach of using a generic
XML-diff/patch.

    >> And at _second_best_ you are backtracking to say "well, we won't use
    >> the XML-diff/patch tools for _that_" but then why would we bother with
    >> them at all when ordinary diff or xdelta would do just as well for the
    >> more restricted purposes?

    > I do not see how tranditional patch tools help with XML-files,
    > especially thosed edited with various dedicated XML or other
    > dedicated editors.

You're missing context.  In this "second best" scenario, the only use
of the diff tool is to reduce the size of changesets and more generic
tools, which are not XML-specific, can do that roughly as well.

> Likely every SVG editor will save an SVG file with a different layout > and represenation.

Yes, and since these files are _both_ XML documents _and_ bytestreams,
it is important not to gloss over those differences.

Consider, for example, the new complexity you are proposing to create
regarding the otherwise simple problem "Are these two filesystem trees
identical in content?"

See above.

Short summary. Checksum tool for XML trees is needed.

Michael






reply via email to

[Prev in Thread] Current Thread [Next in Thread]