[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[no subject]
From: |
Mathieu Othacehe |
Date: |
Sun, 20 Nov 2022 12:32:59 -0500 (EST) |
branch: master
commit fc1641381d2a8a0472a71ef5ad2b64361faaaab4
Author: Mathieu Othacehe <othacehe@gnu.org>
AuthorDate: Sun Nov 20 18:21:42 2022 +0100
remote-worker: Prevent a dead-hang on server disconnection.
This is a follow-up of 1fb4b0ac1297e9bd680d0f4a356ce3050b27f913 that tried
to
work around the remote-worker hangs by introducing a non-blocking read.
This solution was problematic because when the server is unresponsive, the
request-work requests are queued on the worker. When the server is back
online, the requests were all sent to server.
Use instead the ZMQ_PROBE_ROUTER option that causes the server to send an
empty boostrap message to the worker when a connection is established. This
empty message will unlock the workers that were hanging on the request-work
response.
* src/cuirass/scripts/remote-server.scm (zmq-start-proxy): Set the
ZMQ_PROBE_ROUTER option on the build socket.
* src/cuirass/scripts/remote-worker.scm (start-worker): Ignore the bootstrap
message when reading server info however, when receiving a bootstrap message
while waiting for a request-work response, keep going.
---
src/cuirass/scripts/remote-server.scm | 4 ++++
src/cuirass/scripts/remote-worker.scm | 7 +++++++
2 files changed, 11 insertions(+)
diff --git a/src/cuirass/scripts/remote-server.scm
b/src/cuirass/scripts/remote-server.scm
index 8843a95..c168318 100644
--- a/src/cuirass/scripts/remote-server.scm
+++ b/src/cuirass/scripts/remote-server.scm
@@ -469,6 +469,10 @@ frontend to the workers connected through the TCP backend."
(poll-items (list
(poll-item build-socket ZMQ_POLLIN))))
+ ;; Send bootstrap messages on worker connection to wake up the workers
+ ;; that were hanging waiting for request-work responses.
+ (zmq-set-socket-option build-socket ZMQ_PROBE_ROUTER 1)
+
(zmq-bind-socket build-socket (zmq-backend-endpoint backend-port))
(zmq-bind-socket fetch-socket (zmq-fetch-workers-endpoint))
diff --git a/src/cuirass/scripts/remote-worker.scm
b/src/cuirass/scripts/remote-worker.scm
index af1eb2d..37c8afe 100644
--- a/src/cuirass/scripts/remote-worker.scm
+++ b/src/cuirass/scripts/remote-worker.scm
@@ -329,6 +329,10 @@ and executing them. The worker can reply on the same
socket."
(string->bv (zmq-worker-request-info-message)))))
(define (read-server-info socket)
+ ;; Ignore the boostrap message sent due to ZMQ_PROBE_ROUTER option.
+ (match (zmq-get-msg-parts-bytevector socket '())
+ ((empty) #f))
+
(request-info socket)
(match (zmq-get-msg-parts-bytevector socket '())
((empty info)
@@ -379,6 +383,9 @@ and executing them. The worker can reply on the same
socket."
(log-info (G_ "~a: request work.") (worker-name wrk))
(request-work socket worker)
(match (zmq-get-msg-parts-bytevector socket '())
+ ((empty)
+ (log-info (G_ "~a: received a bootstrap message.")
+ (worker-name wrk)))
((empty command)
(run-command (bv->string command) server
#:reply (reply socket)