Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Need Help ! I Am Able To Train Models From Our Local Machines And Log Everything On The Clearml Server Without Any Issues, The Same Training Gets Stuck When I Use Remote Training. The Logs Do Not Provide Any Useful Information, And The Last Line In The L

Need help ! I am able to train models from our local machines and log everything on the ClearML server without any issues, the same training gets stuck when I use remote training. The logs do not provide any useful information, and the last line in the logs is as follows:

"Installing collected packages: Cython
Successfully installed Cython-0.29.34" . This is happening only when I am launching remote agent with - - gpu 0 parameter. Connected GPU on machine is NVIDIA A40 and I can run NVIDIA-smi and see output of it. When I run script for print “Hello” without- - gpu it works and logs on clearML shows connected GPU device also. I am running agent on Ubuntu 22.04 server LTS without docker .it’s completely disconnected machine from internet and we have proxy for apt and pip packages. Let me know if you need more details

  
  
Posted one year ago
Votes Newest

Answers 13


Hi @<1562973095227035648:profile|ThoughtfulOctopus83> , what version of ClearML server are you running? Also what versions of clearml & clearml-agent

  
  
Posted one year ago

@<1562973095227035648:profile|ThoughtfulOctopus83> I assume this machine is also connected to the clearml server?

  
  
Posted one year ago

Yes , machine is connected to on prem ClearML server.

  
  
Posted one year ago

Can you share the complete logs?

  
  
Posted one year ago

Also, can you try sending a GET request to the server using curl? Something like curl None and sharing the result?

  
  
Posted one year ago

@<1523701087100473344:profile|SuccessfulKoala55> Sorry for delay reply , i have attached the logs and issue is only happening when do ML training with PyTorch. Training with other framework is working fine like tensor flow and sklearn.

  
  
Posted one year ago

@<1562973095227035648:profile|ThoughtfulOctopus83> is this the end of the log? Nothing after it? How exactly are you launching the agent?

  
  
Posted one year ago

@<1523701087100473344:profile|SuccessfulKoala55> Yes, this is end of logs and nothing happening after it. i am using this command clearml-agent daemon --detached --gpu 0 --queue A40 to launch the agent.

  
  
Posted one year ago

Can you add --debug ?

  
  
Posted one year ago

And use also --foreground without the detached option, to debug it

  
  
Posted one year ago

@<1523701087100473344:profile|SuccessfulKoala55> after enabling debug mode below are logs , just to let you know this agent do not have internet and pip packages are installed vis proxy which i can working but for pytorch it seems to going to internet "DEBUG:urllib3.connectionpool: http://api.clearml.domain.com:80 "GET /v2.5/tasks.started HTTP/1.1" 200 353
Executing task id [d3807deae2644e00824e774ff8997eaa]:
repository =
branch =
version_num =
tag =
docker_cmd =
entry_point = pytorch.py
working_dir = .

DEBUG:clearml_agent.commands.worker:Searching for python3.7
DEBUG:clearml_agent.commands.worker:Searching for python3
DEBUG:clearml_agent.commands.worker:Searching for python
WARNING:clearml_agent.commands.worker:Python executable with version '3.7' requested by the Task, not found in path, using '/usr/bin/python3' (v3.10.6) instead
NoneType: None
created virtual environment CPython3.10.6.final.0-64 in 134ms
creator CPython3Posix(dest=/home/adminvj/.clearml/venvs-builds/3.10, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/adminvj/.local/share/virtualenv)
added seed packages: pip==23.1, setuptools==67.6.1, wheel==0.40.0
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

INFO:clearml_agent.commands.worker:found literal script in script.diff
DEBUG:clearml_agent.commands.worker:selected execution directory: /home/adminvj/.clearml/venvs-builds/3.10/code

Looking in indexes: https://artifacts.domain.com/repository/pypi/simple
Ignoring pip: markers 'python_version < "3.10"' don't match your environment
Collecting pip<22.3
Using cached https://artifacts.domain.com/repository/pypi/packages/pip/22.2.2/pip-22.2.2-py3-none-any.whl (2.0 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 23.1
Uninstalling pip-23.1:
Successfully uninstalled pip-23.1
Successfully installed pip-22.2.2
Looking in indexes: https://artifacts.domain.com/repository/pypi/simple
Collecting Cython
Using cached https://artifacts.domain.com/repository/pypi/packages/cython/0.29.34/Cython-0.29.34-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.34
INFO:clearml_agent.commands.worker:Found task requirements section, trying to install
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443

1684151695770 worker:0 DEBUG DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443

1684151735969 worker:0 DEBUG DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443

1684151776174 worker:0 DEBUG DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443

1684151816358 worker:0 DEBUG DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443"

  
  
Posted one year ago

@<1523701087100473344:profile|SuccessfulKoala55> Any idea why it is going to internet only when I run training with PyTorch framework download.PyTorch.org

  
  
Posted one year ago

@<1523701087100473344:profile|SuccessfulKoala55> it works once i allow traffic to download.PyTorch.org from proxy. 🙂

  
  
Posted one year ago
965 Views
13 Answers
one year ago
one year ago
Tags
Similar posts