[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Architecture to reduce download time when pulling multiple packages
From: |
Josh Marshall |
Subject: |
Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c! |
Date: |
Sun, 15 Oct 2023 14:21:59 -0400 |
So it sounds like my first steps are to re-implement the downloads
using aria2c. This would affect the minimum base package, no? Can I
get some buy-in from maintainers that such changes are acceptable?
On Fri, Oct 13, 2023 at 2:06 PM James R. Haigh (+ML.GNU.Guix
subaddress) <JRHaigh+ML.GNU.Guix@runbox.com> wrote:
>
> Hi Josh,
>
> At Z-0400=2023-10-13Fri12:36:01, Josh Marshall sent:
> > This is to parallelize connections which should never hurt downloading but
> > can help. Mirroring would be parallelizing for providing packages, what I
> > want to implement is to parallelize obtaining packages. Server side vs
> > client side.
>
> Please, if you are going to do something like this, please use a
> torrent architecture like BitTorrent or GNUnet – I suggest Aria2c as a very
> good CLI download backend that can be daemonised and sent instructions over a
> socket to add, pause, remove downloads, etc., and it supports magnet URLs
> including the existing nontorrent servers (via ‘as’ parameters, iirc.).
>
> I actually implemented this in a local copy of APT Daemon many years
> ago (circa 2011), but the change was not accepted upstream to Launchpad
> (because I was not on bleeding-edge; I was too slow to keep-up with the
> upstream development). My fork got forgotten about, because to get the full
> benefit the server would have had to have added a BitTorrent Info Hash (BTIH)
> to the metadata of each package, along with the MD5, SHA-256, etc. that it
> already did (not a big ask, really). That said, without the full benefit of
> having the metadata, it did provide immediate benefit and I used it for many
> years, not upgrading my Ubuntu 11.04 Natty Narwhal that I was using back then
> until I really had to.
>
> The immediate benefit that it provided was exactly as you described:
> It allowed parallelisation of nontorrent downloads, be it from the same
> server or from multiple mirrors. Iirc., I achieved this by simply passing
> the download list to Aria2c in daemon mode, I think I also converted all the
> HTTP URLs to ‘as’ parameters in magnet links, so that multiple mirrors could
> be passed using multiple ‘as’ parameters in each magnet link. Then I simply
> relied on Aria2c being amazing at parallelising everything that I had given
> it! I then also implemented progress updates such that APT Daemon could
> reflect where Aria2c was up to.
>
> The way I implemented this using Aria2c and magnet URLs meant that if
> additional hashes were known, they could be used as well, and so if the
> server metadata made the simple addition of adding BTIHs, it allows swarming
> to occur, which in-turn would massively reduce load on the central servers,
> and allow anyone who want to be a mirror to be a mirror simply by seeding
> indefinitely. A default share ratio of 1.0 means that no user is a burden on
> the network, unless they deliberately change that. Users can donate to the
> running costs of the project simply by increasing their share ratio, which
> adds another means of contribution that they may find easier than the others.
>
> Anyone keen to keep old packages online can simply seed them
> indefinitely, so this is also really great for archival purposes. Even if
> the central project loses interest in the old packages and deletes them,
> anyone else can keep them up. The hashes ensure that they have not been
> tampered with.
>
> There is also a really cool benefit that occurs, or can occur, on a
> LAN. An entire network of computers can all swarm locally with each other,
> thus needing each package to only need downloading through the metered last
> mile bottleneck from the WAN precisely once – providing that local
> broadcasting is supported. I think this requires Avahi, and I seem to
> remember that Aria2c supports this but I can't remember. I don't ever
> remember getting this bit working but also I did not try hard because it
> would have required the metadata that I didn't have until after download, so
> even if I got it working it would not have been directly useful unless the
> APT repositories that I was using would include the BTIHs.
>
> So yeah, loads of great benefits to this architecture, and I
> highly-recommend it: convert all existing URLs to magnet links (can be done
> client-side as I did; or server-side); optionally add any additional mirrors
> as additional ‘as’ parameters (again client-side or server-side); add ‘btih’
> parameters to the magnet links (the BTIH must be included in the server
> metadata to get the full benefit of the swarming, but conversion to magnet
> link format can be done client-side or server-side); then simply pass all
> this to a really good parallelising backend such as Aria2c; then update any
> progress data and relay pause, resume, cancel, etc. to the backend.
>
> One final note, as I am sure that there are a lot of GNUnet fans on
> this list, is that I would try Aria2c first to see how well it can work, and
> then try GNUnet or whatever else once you have a standard to benchmark
> against. Both are Free Software, so no concern there. Aria2c is an
> all-round download manager CLI that works with or without swarming, i.e. it
> is just as good at HTTPS as it is BitTorrent, and can do both at the same
> time. GNUnet has the advantage of working from SHA-256 iirc., which is
> generally already included in the metadata of the repositories of various
> distributions, but I think it lacks a lot of other features and stability and
> ecosystem of alternative backends, compared to the BitTorrent network.
>
> Of course, there is no harm in including other hashes along with
> BTIH, to allow people to experiment with alternative backends, while always
> ensuring that what works works well. Another hash that may be useful to
> include is the Tiger Tree Hash, which is structurally very similar to BTIH,
> but stronger, iirc..
>
> The first thing that the Guix project can do to signal interest in
> this architecture is to simply include the BTIH of each package in the
> repository metadata. Be it in magnet URL form or not does not matter because
> the client can later convert that as needed. The important thing is an
> authoritative statement in metadata that this version of this package has
> this BTIH. Once that metadata is available, the game is on to implement
> swarming support, be it with Aria2c as a backend (as I recommend at least
> starting with) or otherwise.
>
> I know that this architecture works well out of first-hand experience
> with APT Daemon written in Python. The only failure I had with it was lack
> of upstream support. So I consider it important to first attain the upstream
> approval before really investing more time into this. I seem to remember
> suggesting this to the Nix project many years ago and didn't get anywhere,
> and now I don't have the energy to try to improve upstream projects if they
> reject my ideas, so I'll be interested to see whether you have any success
> with your attempt to do the same.
>
> Good luck! ;-)
>
> Kind regards,
> James.
> --
> Wealth doesn't bring happiness, but poverty brings sadness.
> Sent from Debian with Claws Mail, using email subaddressing as an alternative
> to error-prone heuristical spam filtering.
> Postal: James R. Haigh, Middle Farm, Vennington, nr. Westbury, nr.
> Shrewsbury, Salop, SY5 9RG, Britain
- Architecture to reduce download time when pulling multiple packages, Josh Marshall, 2023/10/11
- Re: Architecture to reduce download time when pulling multiple packages, Christopher Baines, 2023/10/12
- Re: Architecture to reduce download time when pulling multiple packages, Josh Marshall, 2023/10/12
- Re: Architecture to reduce download time when pulling multiple packages, Christopher Baines, 2023/10/13
- Re: Architecture to reduce download time when pulling multiple packages, Josh Marshall, 2023/10/13
- Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!, James R. Haigh (+ML.GNU.Guix subaddress), 2023/10/13
- Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!,
Josh Marshall <=
- Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!, Josh Marshall, 2023/10/17
- Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!, Christopher Baines, 2023/10/18