guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Concerns/questions around Software Heritage Archive


From: Tomas Volf
Subject: Re: Concerns/questions around Software Heritage Archive
Date: Sat, 16 Mar 2024 20:49:44 +0100

On 2024-03-16 12:06:27 -0700, Ian Eure wrote:
>
> Christopher Baines <mail@cbaines.net> writes:
>
> > [[PGP Signed Part:Undecided]]
> >
> > Ian Eure <ian@retrospec.tv> writes:
> >
> > > Hi Guixy people,
> > >
> > > I’d never heard of SWH before I started hacking on Guix last fall,
> > > and
> > > it struck me as rather a good idea.  However, I’ve seen some things
> > > lately which have soured me on them.
> > >
> > > They appear to be using the archive to build LLMs:
> > > https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/
> > >
> > > I was also distressed to see how poorly they treated a developer who
> > > wished to update their name:
> > > https://cohost.org/arborelia/post/4968198-the-software-heritag
> > > https://cohost.org/arborelia/post/5052044-the-software-heritag
> > >
> > > GPL’d software I’ve created has been packaged for Guix, which I
> > > assume
> > > means it’s been included in SWH.  While I’m dealing with their (IMO:
> > > unethical) opt-out process, I likely also need to stop new copies
> > > from
> > > being uploaded again in the future.
> > >
> > > Is there a way to indicate, in a Guix package, that it should
> > > *never*
> > > be included in SWH?
> >
> > Not currently, and I don't really see the point in such a mechanism. If
> > you really never want them to store your code, then you need to license
> > it accordingly (and not make it free software).
> >
>
> I don’t want my code in SWH *because* it’s free.  A primary use of LLMs is
> laundering freely licensed software into proprietary, commercial projects
> through "AI" code completion and generation. Any Free software in an LLM
> training set can and will be used in violation of its license, without a
> clear path for the author to seek recourse.  I deleted my code off Github
> and abandoned it completely for this exact reason, and am deeply irked to be
> going through this nonsense again.
>
> A more salient question may be: Is there a process within Guix (either the
> program or the organization) which uploads source to SWH?  Or does it rely
> on SWH indepently?

`guix lint PKG-NAME' schedules SWH archival if possible.  No code is directly
uploaded (at least currently), so assuming you have a IP list of SWH, it should
be possible to block it.  At least AFAIK.

If you have the list, or know how to get it, could you share it?  I would be
interesting in blocking it as well from my git hosting.

>
> If the latter, my problem is likely solved by blocking SWH at my network
> edge and opting out of their archive (or trying to) and the downstream
> training models they’ve already put it in.  If the former, the only control
> I currently have to protect my license is removing packages from Guix which
> contain it.  I don’t want that outcome.
>
> Noting also that the path here seems to be SWH->huggingface->bigcode
> training set, and the opt-out process for the training set appears to be a
> complete sham.  To opt-out, you must create a Github Issue; only one opt-out
> has *ever* been processed, and there are 200+ sitting there, many with no
> response for nearly a year[1].  I want no part of any of this.
>
>
> > > Is there a way to tell Guix to never download source from SWH?
> >
> > Also no, and it's probably best to do this at the network level on your
> > systems/network if you want this to be the case.
> >
>
> I’ll investigate this, though I’d prefer if there was a way to configure
> source mirrors in the Guix daemon.
>
>
> > Skipping back to this though:
> >
> > > I was also distressed to see how poorly they treated a developer who
> > > wished to update their name:
> > > https://cohost.org/arborelia/post/4968198-the-software-heritag
> > > https://cohost.org/arborelia/post/5052044-the-software-heritag
> >
> > This is probably worth thinking about as Guix is in a similar situation
> > regarding publishing source code, and people potentially wanting to
> > change historical source code both in things Guix packages and Guix
> > itself.
> >
> > Like Software Heritage, there's cryptographical implications for
> > rewriting the Git history and modifying source tarballs or nars that
> > contain source code.
> >
> > We have 17TiB of compressed source code and built software stored for
> > bordeaux.guix.gnu.org now and we should probably work out how to handle
> > people asking for things to be removed or changed (for any and all
> > reasons).
> >
> > It's probably worth working out our position on this in advance of
> > someone asking.
> >
>
> Yes, I agree that Guix needs a better solution for this.
>
> Thanks,
>
>  — Ian
>
> [1]: https://github.com/bigcode-project/opt-out-v2/issues
>

T.

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]