bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#48468: substitute server connection timeout


From: Christopher Baines
Subject: bug#48468: substitute server connection timeout
Date: Sun, 16 May 2021 19:26:22 +0100
User-agent: mu4e 1.4.15; emacs 27.1

Mathieu Othacehe <othacehe@gnu.org> writes:

> Hello,
>
> We recently have a lot of those errors on Cuirass:
>
> --8<---------------cut here---------------start------------->8---
> guix substitute: warning: while fetching 
> http://141.80.167.131:5557/nar/g7ka09613k5v1vlznh87yg35905ggw51-python2-scipy-1.2.2-guile-builder:
>  server is somewhat slow
> guix substitute: warning: try `--no-substitutes' if the problem persists
> guix substitute: error: connect*: Connection timed out
> --8<---------------cut here---------------end--------------->8---
>
> which means that the workers are failing to connect to the Cuirass
> remote-server publish server on berlin at 141.80.167.131:5557.
>
> Stracing this publish server shows that connection reuse seems to be
> broken:
>
> --8<---------------cut here---------------start------------->8---
> accept4(9, {sa_family=AF_INET, sin_port=htons(41742), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41744), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41746), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
> accept4(9, {sa_family=AF_INET, sin_port=htons(41748), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
> accept4(9, {sa_family=AF_INET, sin_port=htons(41750), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41752), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41754), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
> accept4(9, {sa_family=AF_INET, sin_port=htons(41756), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41758), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 26
> accept4(9, {sa_family=AF_INET, sin_port=htons(41760), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
> accept4(9, {sa_family=AF_INET, sin_port=htons(41762), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41764), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41766), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41768), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 22
> accept4(9, {sa_family=AF_INET, sin_port=htons(41770), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41772), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41774), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41776), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41778), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41780), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41782), 
> sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> --8<---------------cut here---------------end--------------->8---
>
> Investigating it, I found that the connection is closed and opened
> multiple times in the call-with-cached-connection procedure of the (guix
> script substitute) module.
>
> It looks like its because a 'bad-headers exception is raised when trying
> to parse an eof object:
>
> --8<---------------cut here---------------start------------->8---
> ;;; (error bad-header (read-header-line #<eof>))
> --8<---------------cut here---------------end--------------->8---
>
> I'm not sure where this eof comes from. There is this comment in the
> http-multiple-get procedure in (guix http-client):
>
> --8<---------------cut here---------------start------------->8---
> ;; Swallow networking errors that could occur due to connection reuse
> ;; and the like; they will be handled down the road when trying to
> ;; read responses.
> (false-if-networking-error
>  (begin
>    (for-each (cut write-request <> buffer) batch)
>    (put-bytevector p (get))
>    (force-output p))))
> --8<---------------cut here---------------end--------------->8---
>
> which would suggest that connection reuse could cause networking errors?
>
> What also puzzles me it that the main guix publish server on berlin does
> not seem to present this issue. That would indicate that this error is
> caused by how the Cuirass remote-server publish server is started or
> configured.
>
> Ludo, Chris, any idea?

While I've been working in this area, I've actually been trying to pick
apart the connection caching, since the single thread assumption doesn't
hold in the Guix Build Coordinator.

Anyway, I do have a theory. Assuming I'm correct in saying that there's
no nginx between the client and publish server here, I think that's your
configuration difference.

For ci.guix.gnu.org, as well as data.guix.gnu.org, it's NGinx which is
keeping connections alive. I'm not sure the Guile code for the publish
server does similarly, so talking to it directly might be different.

That's on the server side, the actual problem is probably on the client
side, as I guess there are possibly places where closed connections
aren't handled properly. This reminds me I sent some patches relating to
closing connections, this could well be related [1].

1: https://issues.guix.gnu.org/47174

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]