guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Preservation of Guix (PoG) report 2023-03-13


From: Timothy Sample
Subject: Preservation of Guix (PoG) report 2023-03-13
Date: Mon, 13 Mar 2023 19:37:23 -0600
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)

Hi Guix,

It’s been a while!  :)

Allow me to present to you a long-overdue update to the Preservation of
Guix (PoG) report: <https://ngyro.com/pog-reports/2023-03-13/>.  🎉

Note that you can link to the most recent version of the report using
<https://ngyro.com/pog-reports/latest/>.

What is this?  Well, I added a description to the report itself, but
here’s a brief teaser.  The PoG report shows what we know about the
archival status of the approximately 54K sources (and counting) Guix has
linked to since around the time of the 1.0 release.

For this edition, I took a bit of time to fix the contrast and colours
to be a bit more accessible.  They’re about half as garish as they used
to be, too.

Over the whole set, 77.1% are known to be safely tucked away in the
Software Heritage archive.  But it’s actually much better than that.  If
we only look at the most recent sampled commit (from Sunday the 5th),
that number becomes 87.4%, which is starting to look pretty good!

I have a few more notes on the report, but I want to put this near the
top of the message so that people will see it.  :)  I wrote a script
(see attached) that uses the PoG database to find missing sources on a
packge-by-package basis.  That is, you can run

    guix repl specification-to-swhids.scm pog.db bash

and it will print a table of all of the transitive sources needed to
build Bash, along with their preservation status.  Here’s a (heavily
edited and snipped to fit an email message) sample of its output:

[... many “stored” inputs]
sha256 0r5p. swh:1:dir:02f7. stored  /gnu/store/.-gmp-6.0.0a.tar.xz
sha256 0c3k. swh:1:dir:6027. stored  /gnu/store/.-mescc-....tar.xz
sha256 1r1z. swh:1:dir:6087. stored  /gnu/store/.-bash-2.05b.tar.gz
sha256 14l0. unknown         unknown /gnu/store/.-gcc-4.9.4.tar.bz2
sha256 0m2y. unknown         unknown /gnu/store/.-ed-1.17.tar.lz
[... more “unknown” inputs]

(I had to pipe the output to “sort -k 4” to have it sorted by status.)

The first two columns are the Guix hash.  The next two columns are the
SWHID (if known) and whether SWH has it (if known).  That last column is
the store filename (which is nice because it usually tells you what it
is we are looking at).  In this sample, you can see that GMP, MesCC
Tools, and Bash are all safe.  However, we don’t know about GCC 4 and
ed.  This is kinda like an automated version of Simon’s recent
investigation [1].  The “unknown” two are due to Disarchive’s lack of
support for those compression formats.  I just wrote this script today
(mind the rough edges), and I’ve learned a lot from trying it on a few
packages.  It’s a little like a terrifying robotic TODO list, since it
shows a lot of problems, but it’s also exiting because solving all the
problems for the Guix package, say, would be a massive leap forward.
Here’s a rough road map for that based on a glance at the script’s
output:

    • Subversion support (for TeX-based documentation stuff, I guess)
    • bzip2 support for Disarchive (there are 45 bzip2 tarballs)
    • ZIP support for Disarchive (for the 8 ZIP files)
    • lzip support for Disarchive (or a workaround for ed)
    • Fix some issues (gettext is .tar.gz, but something went wrong)
    • Do something with the static bootstrap binaries

[1] https://lists.gnu.org/archive/html/guix-devel/2023-02/msg00398.html

If you want to try it out for yourself, you’ll need to download the
database <https://ngyro.com/pog-reports/2023-03-13/pog.db>.  Heads up:
it’s just over 200M, and my server can be pretty slow.

One other stray thought: the script should work with the time machine,
so you can check on packages from the past.  I didn’t test it, but I bet
it’s fine.

Okay.  Here are the rest of my notes about the report itself.

One thing that jumps out at me is 189 Git sources that SWH does not
have.  Usually they have basically all of the non-recursive Git sources.
It’s something to look into.

I also took a quick peek at the 1.9K “unknown” tar-gz sources.  About
39% percent of them are old Rust crates.  It’s a known problem with
Disarchive.  However, 42% of them are old Bioconductor packages.  They
seem to be lost.  It looks like Bioconductor now stores multiple package
versions per Bioconductor version [2], but before version 3.15 that was
not the case.  As an example, take “ggcyto” from Bioconductor 3.10 [3].
We packaged version 1.14.0, and then at some point Bioconductor 3.10
switched to version 1.14.1.  We packaged that, too, but now 1.14.0 is
gone.  I know it’s been discussed before, but I can’t remember what the
conclusion was.  Are these just gone forever?  I’m doing another pass
through all of them and recovering a few from the bordeaux substitute
server, but only a handful.

[2] https://bioconductor.org/packages/3.15/bioc/src/contrib/Archive/DiffBind/
[3] https://bioconductor.org/packages/3.10/bioc/html/ggcyto.html

That’s all for now.  Enjoy the update and the script!


-- Tim

Attachment: specification-to-swhids.scm
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]