Need help ! I am able to train models from our local machines and log everything on the ClearML server without any issues, the same training gets stuck when I use remote training. The logs do not provide any useful information, and the last line in the logs is as follows:

"Installing collected packages: Cython
Successfully installed Cython-0.29.34" . This is happening only when I am launching remote agent with - - gpu 0 parameter. Connected GPU on machine is NVIDIA A40 and I can run NVIDIA-smi and see output of it. When I run script for print “Hello” without- - gpu it works and logs on clearML shows connected GPU device also. I am running agent on Ubuntu 22.04 server LTS without docker .it’s completely disconnected machine from internet and we have proxy for apt and pip packages. Let me know if you need more details

Posted one month ago
Votes Newest

Answers 13

Hi @<1562973095227035648:profile|ThoughtfulOctopus83> , what version of ClearML server are you running? Also what versions of clearml & clearml-agent

Posted one month ago

@<1562973095227035648:profile|ThoughtfulOctopus83> I assume this machine is also connected to the clearml server?

Posted one month ago

Yes , machine is connected to on prem ClearML server.

Posted one month ago

Can you share the complete logs?

Posted one month ago

@<1523701087100473344:profile|SuccessfulKoala55> Yes, this is end of logs and nothing happening after it. i am using this command clearml-agent daemon --detached --gpu 0 --queue A40 to launch the agent.

Posted 17 days ago

@<1562973095227035648:profile|ThoughtfulOctopus83> is this the end of the log? Nothing after it? How exactly are you launching the agent?

Posted 17 days ago

And use also --foreground without the detached option, to debug it

Posted 17 days ago

@<1523701087100473344:profile|SuccessfulKoala55> Any idea why it is going to internet only when I run training with PyTorch framework download.PyTorch.org

Posted 15 days ago

@<1523701087100473344:profile|SuccessfulKoala55> it works once i allow traffic to download.PyTorch.org from proxy. 🙂

Posted 12 days ago

Can you add --debug ?

Posted 17 days ago

@<1523701087100473344:profile|SuccessfulKoala55> Sorry for delay reply , i have attached the logs and issue is only happening when do ML training with PyTorch. Training with other framework is working fine like tensor flow and sklearn.

Posted 17 days ago

@<1523701087100473344:profile|SuccessfulKoala55> after enabling debug mode below are logs , just to let you know this agent do not have internet and pip packages are installed vis proxy which i can working but for pytorch it seems to going to internet "DEBUG:urllib3.connectionpool: http://api.clearml.domain.com:80 "GET /v2.5/tasks.started HTTP/1.1" 200 353
Executing task id [d3807deae2644e00824e774ff8997eaa]:
repository =
branch =
version_num =
tag =
docker_cmd =
entry_point = pytorch.py
working_dir = .

DEBUG:clearml_agent.commands.worker:Searching for python3.7
DEBUG:clearml_agent.commands.worker:Searching for python3
DEBUG:clearml_agent.commands.worker:Searching for python
WARNING:clearml_agent.commands.worker:Python executable with version '3.7' requested by the Task, not found in path, using '/usr/bin/python3' (v3.10.6) instead
NoneType: None
created virtual environment CPython3.10.6.final.0-64 in 134ms
creator CPython3Posix(dest=/home/adminvj/.clearml/venvs-builds/3.10, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/adminvj/.local/share/virtualenv)
added seed packages: pip==23.1, setuptools==67.6.1, wheel==0.40.0
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

INFO:clearml_agent.commands.worker:found literal script in script.diff
DEBUG:clearml_agent.commands.worker:selected execution directory: /home/adminvj/.clearml/venvs-builds/3.10/code

Looking in indexes: https://artifacts.domain.com/repository/pypi/simple
Ignoring pip: markers 'python_version < "3.10"' don't match your environment
Collecting pip<22.3
Using cached https://artifacts.domain.com/repository/pypi/packages/pip/22.2.2/pip-22.2.2-py3-none-any.whl (2.0 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 23.1
Uninstalling pip-23.1:
Successfully uninstalled pip-23.1
Successfully installed pip-22.2.2
Looking in indexes: https://artifacts.domain.com/repository/pypi/simple
Collecting Cython
Using cached https://artifacts.domain.com/repository/pypi/packages/cython/0.29.34/Cython-0.29.34-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.34
INFO:clearml_agent.commands.worker:Found task requirements section, trying to install
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443

1684151695770 worker:0 DEBUG DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443

1684151735969 worker:0 DEBUG DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443

1684151776174 worker:0 DEBUG DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443

1684151816358 worker:0 DEBUG DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443"

Posted 15 days ago

Also, can you try sending a GET request to the server using curl? Something like curl None and sharing the result?

Posted one month ago
13 Answers
one month ago
12 days ago
