Re: [Sks-devel] wserver_timeout value causing cascading failure?

sks-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Sks-devel] wserver_timeout value causing cascading failure?

From:	Jonathon Weiss
Subject:	Re: [Sks-devel] wserver_timeout value causing cascading failure?
Date:	Mon, 24 Apr 2017 15:33:03 -0400

Daniel,

I'm pulling your questions into this thread, which I started before
seeing your mail:

For reference, I can download this key without a problem.  While I'm
topologically closer to pgp.mit.edu than you are, I believe the 1s
timeout should only count the time passing the info to Apache, not all
the way back to you (but please correct me if you think I'm wrong here).
If it is, in fact, taking more than 1s to transfer extremely large keys
from SKS to Apache, then I'm somewhat between a rock and a hard place
here.  If you go back and try again now, are you still seeing the
problem?

As noted, I dropped this timeout form 4s to 1s last week to deal with
the cascading failure described below.

The reverse proxy is Apache, but it is SKS' wserver_timeout that is set
to 1s.  

Any thoughts and advice would be welcome here.  I have a couple, but
they are either of dubious effectiveness, or relatively drastic / much
slower to implement.

        Jonathon



> Hey all--
> 
> If you look at
> https://pgp.mit.edu/pks/lookup?op=get&search=0xF2AD85AC1E42B367
> 
> it appears that there's some sort of proxy failure.
> 
> the end of it looks like:
> 
> -----------
> VcZqMYLvC976Pel/3NSXRKBrgVVWvoiEvH/Zaxxy1RjpRBWomzGInAQQAQIABgUCVu5/IwAK
> CRA+Efo9IPZM80EQA/9xJ2QKliIrKvAnWejhEEGmJvph+XWbBkwEHHTEnhMqpeZx1OJYwqpp
> CmWVVaXxY4ch8nNOvs3F0qPCZ3FkM1Zr4ghxfL2ir+or+4N8j1MyX0lkEtsbyG0AumTjXz+4
> NKO9Sw+KsjBDhOlJsokKLQ3gpHTTP/1.0 408 Request Timeout
> Server: sks_www/1.1.5
> Cache-Control: no-cache
> Pragma: no-cache
> Expires: 0
> Content-length: 599
> Content-type: text/html; charset=UTF-8
> Access-Control-Allow-Origin: *
> 
> 
> 
> 
> 
> 
> 
> Time Out
> Error handling request (GET /pks/lookup?op=get&search=0xF2AD85AC1E42B367): 
> Timed out after 1 seconds
> -----------
> 
> 
> I don't know what kind of reverse proxy is in place here, or why it's
> behaving in this way, but it doesn't seem like this is a healthy
> keyserver (at least for larger keys like this one).
> 
> Any thoughts on how we should check for or fix this kind of failure ?
> should we pull pgp.mit.edu from the pool until it's resolved?
> 
>           --dkg
> 
> 
> 
> 
> 
> Jonathon Weiss <address@hidden> wrote:
> 
> > 
> > Hi All,
> > 
> > As the maintainer of what is probably the most heavily used key-server
> > on the net, I've run into a problem that I wanted to discuss here.
> > 
> > An important note here is that I'm using Apache as a proxy for SKS (on
> > 80, 443, and 11371).
> > 
> > If I understand how SKS works, it can accept and hold onto multiple
> > client connections at once, but only processes them serially.
> > 
> > I think what's going on is something like the following:
> > 
> > 1) multiple client connections come in and are passed from Apache to
> >    SKS (possibly while SKS is working on a previous query).
> > 
> > 2) SKS works on the first query and returns the answer
> > 
> > 3) for some reason the owner of the second query has disappeared (I
> >    assume this is because the client gives up, and maybe hist reload or
> >    something, and Apache notices that the client is gone and drops all
> >    connection state)
> > 
> > 4) SKS waits 'wserver_timeout' (default 60) seconds, and gives up and
> >    goes on to the next connection.
> > 
> > 5) The next client gave up during the timeout, and the problem expands
> >    out of control.
> > 
> > One obvious way to break out of this cycle is if you have a long
> > enough period of time where no requests come in, that all of the
> > timeouts for existing connections can be resolved.  On a mostly idle
> > server, this may be fairly easy to achieve (especially if queries
> > normally arrive at a rate of less than one per minute.
> > 
> > I have no idea what the average request rate is for a pool member, but
> > pgp.mit.edu handles 125k-175k /pks/lookup queries a day (or in round
> > numbers, roughly 1.5 - 2 queries per second).  Obviously, that doesn't
> > leave a lot of windows for long timeouts.
> > 
> > My solution has been to set wserver_timeout=1 (and some less effective
> > timeout tuning on the Apache side), on the theory that Apache running
> > on the same server ought to be able to hand off the query really
> > quickly.  It will take a few more problem free days for me to be fully
> > confident, but wserver_timeout=1 very much looks like it has solved
> > the problem.  For a while I was running with wserver_timeout=4, but
> > that proven insufficient.
> > 
> > 
> > This all leaves me with several questions:
> > 
> > 1) Does anyone see any flaws in my analysis?  or work-around?
> > 
> > 2) Has anyone else encountered anything like this?
> > 
> > 3) Any suggestions on what to do if/when wserver_timeout=1 becomes
> >    insufficient?
> > 
> > 4) Any chance of detecting this sort of problem in sksd and skipping
> >    the timeout altogether?
> > 
> > 
> >     Jonathon
> > 
> >     Jonathon Weiss <address@hidden>
> >     MIT/IS&T/Infrastructure Design & Engineering
> >     Cloud Platforms (Server Operations)

[Prev in Thread]

Current Thread

[Next in Thread]

[Sks-devel] wserver_timeout value causing cascading failure?, Jonathon Weiss, 2017/04/24
- Re: [Sks-devel] wserver_timeout value causing cascading failure?, Jonathon Weiss <=
  - Re: [Sks-devel] wserver_timeout value causing cascading failure?, Kristian Fiskerstrand, 2017/04/24
- Re: [Sks-devel] wserver_timeout value causing cascading failure?, Phil Pennock, 2017/04/24
- Re: [Sks-devel] wserver_timeout value causing cascading failure?, Kim Minh Kaplan, 2017/04/26

Prev by Date: [Sks-devel] wserver_timeout value causing cascading failure?
Next by Date: Re: [Sks-devel] wserver_timeout value causing cascading failure?
Previous by thread: [Sks-devel] wserver_timeout value causing cascading failure?
Next by thread: Re: [Sks-devel] wserver_timeout value causing cascading failure?
Index(es):
- Date
- Thread