Error in launching worker
On a recent build of research-docker (https://gitlab.kwant-project.org/qt/research-docker/-/pipelines/39135) Dask workers fail to start with a cryptic error message:
[sostroukh@hpc05:~]$ cat dask-gateway-worker.e158914
distributed.nanny - INFO - Start Nanny at: 'tls://192.168.3.215:35493'
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.8/site-packages/distributed/process.py", line 191, in _run
target(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/distributed/nanny.py", line 728, in _run
worker = Worker(**worker_kwargs)
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 489, in __init__
os.makedirs(local_directory)
File "/opt/conda/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''
distributed.nanny - INFO - Worker process 46997 exited with status 1
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x2adceb939ca0>>, <Task finished name='Task-3' coro=<Nanny._on_exit() done, defined at /opt/conda/lib/python3.8/site-packages/distributed/nanny.py:440> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/opt/conda/lib/python3.8/site-packages/distributed/nanny.py", line 443, in _on_exit
await self.scheduler.unregister(address=self.worker_address)
File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 861, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 660, in send_recv
raise exc.with_traceback(tb)
File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 513, in handle_comm
result = await result
File "/opt/conda/lib/python3.8/site-packages/distributed/scheduler.py", line 2208, in remove_worker
address = self.coerce_address(address)
File "/opt/conda/lib/python3.8/site-packages/distributed/scheduler.py", line 4946, in coerce_address
raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.nanny - INFO - Closing Nanny at 'tls://192.168.3.215:35493'
PBS script, that is used to launch a worker, happens to be:
#!/bin/sh
#PBS -N dask-gateway-worker
#PBS -v DASK_GATEWAY_API_TOKEN,DASK_GATEWAY_API_URL,DASK_GATEWAY_CLUSTER_NAME,DASK_GATEWAY_TLS_CERT,DASK_GATEWAY_TLS_KEY,DASK_GATEWAY_WORKER_NAME
#PBS -l nodes=1:ppn=1,mem=2048MB
singularity exec research-docker.simg dask-gateway-worker --memory-limit 2147483648
I don't see anything suspicious here.
I suspect that there is a problem in updating distributed
from 2.18.0 to 2.20.0.