savannah-hackers-public
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Savannah-hackers-public] Emacs git repository clone limits


From: Philippe Vaucher
Subject: Re: [Savannah-hackers-public] Emacs git repository clone limits
Date: Sat, 30 May 2020 11:55:45 +0200

Thanks for the amazingly detailed answer!

> > > > Is there a limit and/or maintainance going? Was I but in some sort of
> > > > throttle-list?
> > >
> > > We do have rate limits in place.  Because otherwise the general
> > > background radiation activity of the Internet will break things.
> > >
> > > However nothing has changed recently in regard to the rate limits for
> > > a long time.  As I look at the logs the last rate limit change was Sat
> > > Dec 7 06:31:50 2019 -0500 which seems long enough ago that it isn't
> > > anything recent.  Meaning that this is probably simply just you
> > > competing with other users on the Internet for resources.
> >
> > Until recently I only did up to ~8 concurrent git clones, but recently
> > with infrastructure changes I'm able to do much more.
>
> For a high level of parallelism I would definitely try to fanout using
> local resources.  It would be much more reliable.

I agree.


> > > Cloning with the https transport that uses git-http-backend for the
> > > backend.  We are using Nginx rate limiting.  Which you can read about
> > > how the algorithm works here.  It is basically a smoothing process.
> >
> > Until recently I was cloning git://git.sv.gnu.org/emacs.git, the
> > switch to https is an attempt at working around the limitation I hit
> > recently. My train of thought was that http:// is easier to scale than
> > git://, if you say otherwise I can revert back to git:// clones.
>
> Well...  It's not simple!  And also things change with time and server
> resources and configuration.  So any answer I were to state today
> would mutate into a wrong answer at some different point in time.
>
> I understand your reasoning.  And if this were an infrastructure at,
> say, Amazon AWS EC2, setup to be elastic and just scaled out then your
> reasoning would be totally on target.  Increased load would scale out
> to more parallel resources.  It is more typical to have a load
> balancer in front of http/https protocol servers and so those can be
> set up rather straight forward.  And also load balancers could be set
> up in front of git:// protocol server too.
>
> However the GNU Project is dedicated and directed to using only Free
> Software.  And is hosted in this goal by the FSF.  Which means
> everything is self-hosted.  Funded by the annual fund raising from
> donors just like all of us who donate funding.  And resources are
> limited.  Note that Github is not Free Software.  Therefore we cannot
> endorse its use.  Same thing with Amazon AWS.  Although regardless of
> this we know that many users do use them anyway.

Makes sense.

> Let's talk about the technical details of the differences between git
> protocol servers and http/https protocol servers.
>
> The abbreviated details of the git-daemon is that it is running as the
> 'nobody' user like this.
>
>   git daemon --init-timeout=10 --timeout=28800 --max-connections=15 
> --export-all --base-path=/srv/git --detach
>
> In this configuration git-daemon acts as the supervisor process
> managing its children.  The limit values are tweaked and learned from
> having too many connections causing failures and tuned lower in order
> to prevent this.
>
> When connections exceed the max-connection limit then they will queue
> to the kernel limit /proc/sys/net/core/somaxconn (defaults to 128) and
> be serviced as able.  Any queued connections past the 128 default the
> client will get a connection failure.  The behavior at that point is
> client dependent.  It might retry.

Interesting. So you're saying the timeouts I had when using git://
meant that there was too much queue.

> The nginx configuration is this:
>
>         location /git/ {
>                 autoindex on;
>                 root /srv;
>                 location ~ ^/git(/.*/(info/refs|git-upload-pack)$) {
>                         gzip off;
>                         include fastcgi_params;
>                         fastcgi_pass unix:/var/run/fcgiwrap.socket;
>                         fastcgi_param SCRIPT_FILENAME 
> /usr/local/sbin/git-http-backend;
>                         fastcgi_param PATH_INFO $1;
>                         fastcgi_param GIT_HTTP_EXPORT_ALL true;
>                         fastcgi_param GIT_PROJECT_ROOT /srv/git;
>                         client_max_body_size 0;
>                 }
>         }
>
> Looking at this now I see there is no rate limit being applied to this
> section.  Therefore what I mentioned previously applies to the cgit
> and gitweb sections which have been more problematic.  With no rate
> limits all clients will be attempted.  Hmm...  I think that may have
> been a mistake.  It is possible that adding a rate limit will smooth
> the resource use and actually improve the situation.  The cgit and
> gitweb sections use a "limit_req zone=one burst=15;" limit.  cgit in
> particular is resource intensive for various reasons.  I'll need to do
> some testing.

So, where did my 502/504 errors come from? Each job was retried 3
times, with a 5 seconds delay. I'd understand some of them failing but
not all of them.


> When you are seeing proxy gateway failures I think it most likely that
> the system is under resource stress and is unable to launch a
> git-http-backend process within the timeouts.  This resource stress
> can occur as a sum total of everything that is happening on the server
> at the same time.  It includes git://, http(s)://, and also svn and
> bzr and hg.  (Notably all of the CVS operations are on a different VM,
> though likely on the same host server.)  All of those are running on
> this system and when all of them coincidentally spike use at the same
> time then they will compete with themselves for resources.  The system
> will run very slowly.  I/O is shared.  Memory is shared.

Ah, ignore my question above then :-) Interesting!

> Among other things the current VM has Linux memory overcommit
> enabled.  Which means that the OOM Out of Memory Killer is triggered
> at times.  And when that happens there is no longer any guarentee that
> the machine is in a happy state.  Pretty much it requires a reboot to
> ensure that everything is happy after the OOM Killer is invoked.  The
> new system has more resources and I will be disabling overcommit which
> avoids the OOM Killer.  I strongly feel the OOM killer is
> inappropriate for enterprise level production servers.  (I would have
> sworn it was already disabled.  But looking a bit ago I saw that it
> was enabled.  Did someone else enable it?  Maybe.  That's the problem
> of cooking in a shared kitchen.  Things move around and it could have
> been any of the cooks.)

Good lead, maybe the parallel git clones trigger too much memory and
basically each one of them gets killed eventually.

> > > Whenever I have set up continuous integration builds I always set up a
> > > local mirror of all remote repositories.  I poll with a cronjob to
> > > refresh those repositories.  Since the update is incremental it is
> > > pretty efficient for keeping the local mirrors up to date.  Then all
> > > of the continuous integration builds pull from the local mirror.
> > >
> > > This has a pretty good result in that the LAN is very robust and only
> > > shows infrastructure failures when there is something really
> > > catastrophic happening on the local network.  Since 100% of everything
> > > is local in that case.

Yes, looks like I have to go that route. Also it makes sense design wise.

> > It's what I actually used back in the days, the Dockerfile didn't
> > clone the repository but did copy the already checked-out repository
> > inside the image. That has all the advantages you cited, but cloning
> > straight from your repository makes my images more trustworthy because
> > the user sees that nothing fishy is going on.
>
> Since git commits are hash ids there should be no difference in the
> end.  Since commit a given hash id will be the same regardless of how
> it arrived there.  I don't see how anyone can say anything fishy is
> happening.  I might liken it to newsgroups.  It doesn't matter how the
> article arrive and may have come from any of a number of routes.  It
> will be the same article regardless.  With git the hash id ensures
> that the object content is identical.
>
> > Also he can just take my Dockerfile and build it directly without
> > having to clone something locally first.
>
> I didn't quite follow the why of this being different.  Generally I
> would like to see cpu effort distributed so that it is amortized
> across all participants as much as possible.  As opposed to having it
> lumped.  However if something can be done once instead of done
> repeatedly then of course that is better for that reason.  Since I
> didn't quite follow the detail here I can only comment with a vague
> hand waving response that is without deep meaning.

I take it you are not really familiar with docker and dockerfiles and
that's why you don't really understand why I'm making a point about
having the clone as "clean" as possible.

In the docker world you have images, which is the entire OS plus
usually one program. Then you can run these images and have complete
reproductibility no matter where you run this image as all the
dependencies are bundled together.

To build these images you use a dockerfile, which contains the
instructions to build this image. Thus when you download one of the
image, it's common to go have a look at how it is built. If you see
one dockerfile where it simply clones a repository, versus another
where it copies a local directory that you are told is a clone from
the repository, you tend to trust the first one more.

> > To be honest I think my realistic alternatives here are to find the
> > right clone limit (4? 8? 20? depending on the hour of the day) and use
> > one which is reasonable in terms of time it takes to build and abuse
> > of your servers. The images are usually only built once per day, and
> > because it's all cached they are only built when the base image
> > changes, which is like once per month. So most of the time I do *not*
> > clone anything from your repositories... that's when I'd like all the
> > images building in parralel, but when suddenly each of the images
> > requires a clone then that's where I'd like at most 2 images building
> > simultenaously to ensure it works.
>
> Another time honored technique is to wrap "the action" with a retry
> loop.  Try.  If failed then sleep for a bit and retry.  As long as the
> entire retry loop succeeds then report that as a success not a
> failure.  Too bad git clone does not include such functionality by
> default.  But it shouldn't be too hard to apply.

Already done, didn't change anything. My guess is that the parralel
git clone triggers OOM and once one fails all the other fails too.
Because it is retried in parralel too the number of concurrent git
clones is still too high and that fails. The only thing I can do here
is limit the amount of clones drastically or use a local repository as
you mentionned.


> > I just had this thought that maybe I could play man-in-the-middle with
> > /etc/hosts and make-believe git.sv.gnu.org is a local repository, and
> > once per day I sync that local repo with the real one. That was the
> > dockerfile would appear as cloning the real repo yet caching would be
> > done.
>
> Clever.  But is it needed?  You could easily have multiple remotes.
> It doesn't matter which you clone from.  The hashes will be the same.


> Among the arguments between git protocol and https protocol many are
> worried about agents injecting malicious code into the unencrypted git
> stream.  I am not sure this is possible due to git's hashing.  But
> with https it is prevented.  Due to this there is an effort to use
> https everywhere.  I am not sure it is possible to successfully inject
> code into a git protocol stream.
>
> Using https everywhere has a nice appeal.  Until one is trying to sort
> things out on the server side and try to differentiate attacks, abuse,
> and valid use.  It's all mixed together.  And it is all happening
> continuously.  It is like standing under a water fall trying to figure
> out where the broken water pipe is located.

Yes, that was also one of the reasons for me to switch to HTTPS.

> You might consider using ssh transfer protocol.  Since that is all
> authenticated member access.  It is encrypted and therefore avoids
> injection attacks.  It does require a valid member account to hold the
> ssh keys.

First of all I'd need to have member access, and using secrets in the
dockerfile is tricky. Second of all it'd mean my dockerfiles are only
buildable by me which is kinda against the purpose of dockerfiles.
Thanks for the idea tho.

> I know this has been a long and rambling email.  I salute you to have
> reached the end of it. :-)
>
> Bob

Thanks a lot! I was a nice read.

Philippe



reply via email to

[Prev in Thread] Current Thread [Next in Thread]