instability with large number of workers
Short description
Whenever one requests a large number of workers (>100), the dask cluster calculation shows instabilities: the workers cant spawn, they spawn and get killed instantly or it takes a very long time to spawn. This makes it very hard to efficiently utilize hpc05.
What I think goes wrong
This is pure speculation, but since the dask-worker logs (see below) complain about TLS and SSL certification, I believe it's a communication error that arises when too many workers are launched. Incidentally, this happens with worker_cores=1
default option and I suspect this would not happen with worker_cores=2
(confirmed by Antonio)
Minimal code to reproduce the error
from dask_quantumtinkerer import Cluster
import numpy as np
import os
def test(x):
return x**2
x = np.linspace(0, 1, 1000)
cluster = Cluster()
print('http://io.quantumtinkerer.tudelft.nl/user/'+ os.environ.get('JUPYTERHUB_USER') + '/proxy/' +
cluster.dashboard_link[17:])
cluster.scale(200) # large number causes instabilities
client = cluster.get_client()
result_ungathered = client.map(test, x)
result = client.gather(result_ungathered)
Error logs
These are the relevant dask-scheduler logs:
2022-10-18 22:14:09,780 - distributed.scheduler - INFO - -----------------------------------------------
2022-10-18 22:14:09,780 - distributed.preloading - INFO - Creating preload: dask_gateway.scheduler_preload
2022-10-18 22:14:09,780 - distributed.preloading - INFO - Import preload module: dask_gateway.scheduler_preload
2022-10-18 22:14:12,650 - distributed.scheduler - INFO - State start
2022-10-18 22:14:12,657 - distributed.scheduler - INFO - -----------------------------------------------
2022-10-18 22:14:12,659 - distributed.scheduler - INFO - Scheduler at: tls://192.168.3.220:40093
2022-10-18 22:14:12,659 - distributed.scheduler - INFO - dashboard at: :36907
2022-10-18 22:14:12,659 - distributed.preloading - INFO - Run preload setup: dask_gateway.scheduler_preload
2022-10-18 22:14:13,174 - dask_gateway.scheduler_preload - INFO - Requesting scale to 200 workers from 0
2022-10-18 22:14:14,052 - distributed.scheduler - INFO - Receive client connection: Client-6eece6f5-4f21-11ed-8404-0242ac12000a
2022-10-18 22:14:14,060 - distributed.core - INFO - Starting established connection
2022-10-18 22:15:30,716 - distributed.comm.tcp - WARNING - Listener on 'tls://0.0.0.0:40093': TLS handshake failed with remote 'tls://192.168.4.122:57954': EOF occurred in violation of protocol (_ssl.c:997)
2022-10-18 22:15:31,275 - distributed.comm.tcp - WARNING - Listener on 'tls://0.0.0.0:40093': TLS handshake failed with remote 'tls://192.168.4.127:37216': [SSL: SSLV3_ALERT_BAD_CERTIFICATE] sslv3 alert bad certificate (_ssl.c:997)
2022-10-18 22:22:31,350 - distributed.comm.tcp - WARNING - Listener on 'tls://0.0.0.0:40093': TLS handshake failed with remote 'tls://192.168.3.216:50426': [SSL: SSLV3_ALERT_BAD_CERTIFICATE] sslv3 alert bad certificate (_ssl.c:997)
2022-10-18 22:22:31,357 - distributed.comm.tcp - WARNING - Listener on 'tls://0.0.0.0:40093': TLS handshake failed with remote 'tls://192.168.3.216:50428': [SSL: SSLV3_ALERT_BAD_CERTIFICATE] sslv3 alert bad certificate (_ssl.c:997)
2022-10-18 22:22:35,360 - distributed.scheduler - INFO - Register worker <WorkerState 'tls://192.168.3.219:33142', name: ad8079d937044431b6c4b5f65e08c7c8, status: init, memory: 0, processing: 0>
2022-10-18 22:22:35,362 - distributed.scheduler - INFO - Starting worker compute stream, tls://192.168.3.219:33142
2022-10-18 22:22:35,362 - distributed.core - INFO - Starting established connection
2022-10-18 22:22:35,542 - distributed.scheduler - INFO - Register worker <WorkerState 'tls://192.168.3.219:40012', name: 1be1baebb7ad423aba2a803ab89659a4, status: init, memory: 0, processing: 0>
2022-10-18 22:22:35,542 - distributed.scheduler - INFO - Starting worker compute stream, tls://192.168.3.219:40012
2022-10-18 22:22:35,542 - distributed.core - INFO - Starting established connection
2022-10-18 22:22:36,023 - distributed.scheduler - INFO - Register worker <WorkerState 'tls://192.168.3.217:34098', name: c1880fb7841c4182af6a0603a099d1d6, status: init, memory: 0, processing: 0>
... (this repeats for all workers)...
2022-10-18 22:22:55,416 - distributed.scheduler - INFO - Remove worker <WorkerState 'tls://192.168.3.202:38863', name: d075a272d25f4871b88ebb14da3b416e, status: running, memory: 0, processing: 0>
2022-10-18 22:22:55,416 - distributed.core - INFO - Removing comms to tls://192.168.3.202:38863
2022-10-18 22:22:55,456 - distributed.scheduler - INFO - Remove worker <WorkerState 'tls://192.168.3.209:38817', name: 0727ce4bc337495b8bfa26b159b1385e, status: running, memory: 0, processing: 0>
2022-10-18 22:22:55,456 - distributed.core - INFO - Removing comms to tls://192.168.3.209:38817
2022-10-18 22:22:55,533 - distributed.scheduler - INFO - Remove worker <WorkerState 'tls://192.168.3.206:39483', name: 334d39b07b25426593fe4cd201ba044c,
...(this also repeats for all workers)...
...(it then repeats the cycle with starting workers and removing them)...
2022-10-18 22:26:13,959 - distributed.scheduler - INFO - Lost all workers
2022-10-18 22:26:22,303 - distributed._signals - INFO - Received signal SIGTERM (15)
2022-10-18 22:26:22,304 - distributed.scheduler - INFO - Scheduler closing...
2022-10-18 22:26:22,304 - distributed.scheduler - INFO - Scheduler closing all comms
2022-10-18 22:26:22,308 - distributed.scheduler - INFO - Stopped scheduler at 'tls://192.168.3.220:40093'
2022-10-18 22:26:22,309 - distributed.scheduler - INFO - End scheduler
Similarly, here are the relevant dask-gateway-worker logs. Note, not all workers show these errors, some seem fairly normal and get simply killed.
2022-10-18 22:15:30,720 - distributed.nanny - INFO - Start Nanny at: 'tls://192.168.4.126:33558'
2022-10-18 22:15:31,313 - distributed._signals - INFO - Received signal SIGTERM (15)
2022-10-18 22:15:31,313 - distributed.nanny - INFO - Closing Nanny at 'tls://192.168.4.126:33558'.
2022-10-18 22:15:31,314 - distributed.nanny - INFO - Nanny asking worker to close
2022-10-18 22:15:31,316 - distributed.nanny - INFO - Worker process 2101 was killed by signal 15
2022-10-18 22:15:31,319 - distributed.dask_worker - INFO - End worker
/opt/conda/lib/python3.10/site-packages/distributed/cli/dask_worker.py:266: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
warnings.warn(
2022-10-18 22:13:39,770 - distributed.nanny - INFO - Closing Nanny at 'tls://192.168.4.127:43970'.
2022-10-18 22:13:39,773 - distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 498, in connect
stream = await self.client.connect(
File "/opt/conda/lib/python3.10/site-packages/tornado/tcpclient.py", line 288, in connect
stream = await stream.start_tls(
File "/opt/conda/lib/python3.10/site-packages/tornado/iostream.py", line 1391, in _do_ssl_handshake
self.socket.do_handshake()
File "/opt/conda/lib/python3.10/ssl.py", line 1342, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate is not yet valid (_ssl.c:997)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 481, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 359, in start_unsafe
msg = await self.scheduler.register_nanny()
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1151, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1372, in connect
return await connect_attempt
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1308, in _connect
comm = await connect(
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 291, in connect
comm = await asyncio.wait_for(
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 513, in connect
raise FatalCommClosedError(
distributed.comm.core.FatalCommClosedError: TLS certificate does not match. Check your security settings. More info at https://distributed.dask.org/en/latest/tls.html
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/bin/dask-worker", line 10, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/cli/dask_worker.py", line 447, in main
asyncio.run(run())
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/distributed/cli/dask_worker.py", line 444, in run
[task.result() for task in done]
File "/opt/conda/lib/python3.10/site-packages/distributed/cli/dask_worker.py", line 444, in <listcomp>
[task.result() for task in done]
File "/opt/conda/lib/python3.10/site-packages/distributed/cli/dask_worker.py", line 419, in wait_for_nannies_to_finish
await asyncio.gather(*nannies)
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 489, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.
/opt/conda/lib/python3.10/site-packages/distributed/cli/dask_worker.py:266: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
warnings.warn(