[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Conda environments and reproducibility
From: |
Simon Tournier |
Subject: |
Re: Conda environments and reproducibility |
Date: |
Fri, 02 Dec 2022 14:59:42 +0100 |
Hi,
On Fri, 02 Dec 2022 at 12:05, Ludovic Courtès <ludo@gnu.org> wrote:
> Hugo Buddelmeijer <hugo@buddelmeijer.nl> skribis:
>
>> That is, "conda env export" should contain entries like
>> "scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define the
>> dependencies 'that matter', like which compiler is used. What goes into the
>> hash seems rather complicated, and grows over time.
>
> I think one source of many problems here is to think that there are
> dependencies that do not matter. Another one, which those hashes appear
> to address, is to think that a name/version pair is enough to
> unambiguously designate a software artifact.
>
> This hash is a hash of the build result, not a hash of the input, is
> that correct?
Well, the official Conda documentation seems explanatory, IMHO. For
instance,
https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html#matchspec-vs-packagerecord
>From my understanding, if you go via MatchSpec then the SAT solver is
invoked. The SAT solver tries to satisfy all the constraints and the
solution depends on the state of the index (the upstream repository).
Aside the SAT solver can be very long and even fails if the constraints
are too hard, there is no guarantee that the SAT solver will find the
exact same combination for the packages to install. Having an equality
(numpy=1.23) or something else does not really change this point.
Conda offers the option to be “explicit”. And in that case, the solver
is not invoked. Somehow, it is a way to directly deal with
PackageRecord. Then, the Conda documentation has these warnings:
* Explicit package installs
Since the solver is not involved, the dependencies of the
explicit package(s) are not processed at all. This can leave the
environment in an inconsistent state, which can be fixed by
running conda update --all, for example.
* Cloning an environment
It essentially takes the source environment, generates the URLs
for each installed packages (filtering conda, conda-env and
their dependencies) and passes the list of URLs to
explicit(). If the source tarballs are not in the cache anymore,
it will query the index for the best possible match for the
current channels. As such, there’s a slim chance that the copy
is not exactly a clone of the original environment.
https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html#early-exit-tasks
Therefore, the official Conda documentation explains that it is not
possible to have some guarantee about reproducing an environment.
> I think it would be great to have a blog post that walks through
> shortcomings and concrete issues one may encounter when trying to
> reproduce a software environment with Conda, contrasting it with how
> Guix does thing. This would probably make more sense for people who use
> Conda everyday than a high-level overview of Guix.
>From my understanding, the main issue is that Conda perfectly works when
you are in a short temporal window (2-3 months, say!). In this range,
people can often reproduce. It becomes more complicated outside this
range – so it is hard to demo for explaining. :-)
For sure, a blog post by people being fluent in both Conda and Guix
would be very welcome. Aside the discussion about reproducibility, just
a Rosetta Stone comparing how to do that using Conda vs Guix. It would
smooth the migration and at least give a try with Guix. :-)
Cheers,
simon