guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Preservation of Guix (PoG) report 2023-03-13


From: Timothy Sample
Subject: Re: Preservation of Guix (PoG) report 2023-03-13
Date: Sat, 18 Mar 2023 14:35:40 -0600
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)

Hey,

Simon Tournier <zimon.toutoune@gmail.com> writes:

> Well, I do not remember if you consider also the ’origin’
> (fixed-outputs) as ’inputs’ or ’patches’.  Do you?

I’m quite confident I’m getting everything.  I’ll describe my approach,
because I’m happy with it.  :)

The Guix package graph exists twice, essentially.  There’s the
high-level representation made up of packages, origins, gexps, etc.
Then, there is the low-level representation which is just derivations.
The high-level representation has nice metadata and makes sense to
humans, while the low-level representation is easy to traverse.

AFAICT, there’s no generic way to traverse the high-level
representation.  Every lowerable object has complete control over how it
references other lowerable objects, and is not obliged provide any means
of listing those references.  That is, there’s no ‘lowerable-inputs’
procedure or anything like that.  (We have ‘bag-node-edges’ in ‘(guix
scripts graph)’, but it doesn’t cover everything.)

What I do for the report is traverse (as best I can) the high-level
representation and construct a map from derivations to origin objects.
Then, I traverse the low-level representation to find all the
fixed-output derivations.  Finally, I use the map to look up origin
objects for each fixed-output derivation.  If I miss an origin object,
the fixed-output derivation still gets recorded.  It will show up in the
report as “unknown” until I investigate why it’s missing and correct it.

There’s currently 56 (out of 54K) fixed-output derivations that are
missing metadata in my database.  A fair few of them have to do with
Telegram, Thunderbird, and UBlock Origin.  All it means is that those
packages have sneaky ways of referencing origins that my code can’t
handle.  It’s harmless and easy to fix as time permits.

>> Over the whole set, 77.1% are known to be safely tucked away in the
>> Software Heritage archive.  But it’s actually much better than that.  If
>> we only look at the most recent sampled commit (from Sunday the 5th),
>> that number becomes 87.4%, which is starting to look pretty good!
>
> Just to be point the new nixguix loader [1] is still in SWH staging and
> not yet deployed, IIRC.  It will not change much the coverage on our
> side but it should be fix some corner-cases.
>
> 1: <https://gitlab.softwareheritage.org/swh/meta/-/issues/4662>

Good to know!

>>      This is kinda like an automated version of Simon’s recent
>> investigation.
>
> Neat!  Note that I also wanted to check the SWH capacity for cooking,
> not only checking the end points.  For instance, it allowed to discover
> mismatch due to uncovered CR/LF normalization; now fixed with:
> 58f20fa8181bdcd4269671e1d3cef1268947af3a.

Maybe we need a “chaos monkey mode” for Guix.  It could randomly select
packages to build, randomly pick source code fallback methods, and also
test reproducibility (like “--check”).  You could have a blocklist for
browsers, etc., but otherwise it could pick the odd package to test
thoroughly.  Those of us with the time and inclination could crank up
that knob and get interesting feedback about reproducibility at the cost
of doing a few package builds here and there.

>> Here’s a rough road map for that based on a glance at the script’s
>> output:
>>
>>     • Subversion support (for TeX-based documentation stuff, I guess)
>
> For the interested reader, details for helping in the implementation:
>
>     https://issues.guix.gnu.org/issue/43442#9
>     https://issues.guix.gnu.org/issue/43442#11

Fantastic.  That looks very promising!

> However, it would ease all the dance if SWH would consider to store and
> expose NAR hashes on their side.  As discussed here:
>
>     https://gitlab.softwareheritage.org/swh/meta/-/issues/4538

This would be nice, yes.

>>              However, 42% of them are old Bioconductor packages.  They
>> seem to be lost.  It looks like Bioconductor now stores multiple package
>> versions per Bioconductor version [2], but before version 3.15 that was
>> not the case.  As an example, take “ggcyto” from Bioconductor 3.10 [3].
>> We packaged version 1.14.0, and then at some point Bioconductor 3.10
>> switched to version 1.14.1.  We packaged that, too, but now 1.14.0 is
>> gone.
>
> Well, I have not investigated much because it is between December 2019
> and March 2020 thus “guix time-machine” is not smooth for this old time.
>
> First question, does we have the source tarball in Berlin or Bordeaux or
> somewhere else?  If yes, there is a hope. :-) Else, it is probably gone
> forever.

Like I wrote, I picked up a handful from Bordeaux, but not much.

> The hope is: https://git.bioconductor.org/packages/ggcyto
>
> If we have the tarball with the correct checksum from commit
> f5f440312d848e12463f0c6f7510a86b623a9e27
>
> +    (version "1.14.0")
> +    (source
> +     (origin
> +       (method url-fetch)
> +       (uri (bioconductor-uri "ggcyto" version))
> +       (sha256
> +        (base32
> +         "165qszvy5z176h1l3dnjb5dcm279b6bjl5n5gzz8wfn4xpn8anc8"))))
>
> then we can disassemble it and then using the Git repository, we can try
> to assemble the content from SWH and the meta from Disarchive DB.

I played around with this approach a bit, but it’s extremely tedious,
and I’m not hopeful it will work.  Even if it does, it will be hard to
automate.  I never fully tested the idea, just decided the effort was
too high for such a low probability of success.  I’m putting these in
the “low priority” bin for now.


-- Tim



reply via email to

[Prev in Thread] Current Thread [Next in Thread]