Hi, I Started My Agent Using. Clearml-Agent Daemon --Gpus 0 --Queue Gpu --Docker --Foreground, With The Following Parameters In Clearml.Conf.

Answered

Hi,

I started my agent using.
clearml-agent daemon --gpus 0 --queue gpu --docker --foreground, with the following parameters in clearml.conf.
` default_docker: {
# default docker image to use when running in docker mode
image: "dockerrepo/mydocker:custom"

    # optional arguments to pass to docker image
    # arguments: ["--ipc=host", ]
    arguments: ["--env GIT_SSL_NO_VERIFY=true",]
} `

Then this is shown while waiting for tasks.
` Worker "master-node:gpu0" - Listening to queues:
+----------------------------------+------+-------+
| id | name | tags |
+----------------------------------+------+-------+
| 943fce37803044ef89f6d9af0cd5279c | gpu | |
+----------------------------------+------+-------+

Running in Docker mode (v19.03 and above) - using default docker image: dockerrepo/mydocker:custom running python3 So far so good except that when a task is pulled, i get this as output. If you noticed, first the docker image is reverted to nvidia/cuda:10.1-runtime-ubuntu18.04, and there's no indication that the arg --env is passed on. task 228caa5d25d94ac5aa10fa7e1d02f03c pulled from 943fce37803044ef89f6d9af0cd5279c by worker master-node:gpu0
Running task '228caa5d25d94ac5aa10fa7e1d02f03c'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.xmqr15w5.txt', '/tmp/.clearml_agent_out.xmqr15w5.txt'
Running Task 228caa5d25d94ac5aa10fa7e1d02f03c inside docker: nvidia/cuda:10.1-runtime-ubuntu18.04
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '-e', 'CLEARML_WORKER_ID=master-node:gpu0', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04', '-v', '/home/jax/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.txivbuei.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.04t66_qn:/root/.ssh', '-v', '/home/jax/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/jax/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/jax/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/jax/.clearml/cache:/clearml_agent_cache', '-v', '/home/jax/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 228caa5d25d94ac5aa10fa7e1d02f03c'] `

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 29

running git diff on my terminal in this repo gave nothing. nothing at all.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

In the task you cloned, do you have torch as part of the requirements?

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

you need to run it, but not actually execute it. You can execute it on the ClearML agent with task.execute_remotely(queue_name='YOUR QUEUE NAME', exit_process=True) .

with this, the task wont actually run from your local machine but just register in the ClearML app and will run with the ClearML agent listening to 'YOUR QUEUE NAME' .

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Hi,
It did, nvidia/cuda:10.1-runtime-ubuntu18.04.

So if i need to set this every time, what is the following config for? And how do i pass in new env parameters?
` default_docker: {
# default docker image to use when running in docker mode
image: "dockerrepo/mydocker:custom"

    # optional arguments to pass to docker image
    # arguments: ["--ipc=host", ]
    arguments: ["--env GIT_SSL_NO_VERIFY=true",]
} `

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

To note, the latest codes have been pushed to the Gitlab repo.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Yes, as listed in the snippet. The torch library is torchvision.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

So according to it, you are using the repo requirements, and you have torch there?

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Yes of cos, its a long one.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

After the task is cloned, the task is in a draft state. In this state every field is editable, so you can just double click the BASE DOCKER IMAGE section and change it to your image. If you’ll just delete the value from this section, then the ClearML agent will use the docker image you configure in the clearml.conf file (dockerrepo/mydocker:custom).

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

And how do i pass in new env parameters?

If you don’t value in the task for BASE DOCKER IMAGE, it will use the default, if you are setting the BASE DOCKER IMAGE, add the env vars to it too:

dockerrepo/mydocker:custom --env GIT_SSL_NO_VERIFY=true

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Thanks. That's easy to miss as its not quite apparent in the main docs. How should i pass in env variables with Task?

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

where task is the value return from your Task.init call,

task = Task.init(project_name=<YOUR PROJECT NAME>, task_name=<YOUR TASK NAME>)

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

They should be copied, I just want to verify they are.

If so, can you send the logs of the failed task?

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Thank. Gonna try that out. But i hit another snag. Strangely, the Agent is not creating the right venv. This is what the Agent created.
` pip:

asn1crypto==0.24.0
attrs==20.3.0
certifi==2020.12.5
chardet==4.0.0
cryptography==2.1.4
Cython==0.29.22
furl==2.1.0
future==0.18.2
humanfriendly==9.1
idna==2.6
importlib-metadata==3.7.0
jsonschema==3.2.0
keyring==10.6.0
keyrings.alt==3.0
orderedmultidict==1.0.1
pathlib2==2.3.5
psutil==5.8.0
pycrypto==2.6.1
pygobject==3.26.1
pyhocon==0.3.57
PyJWT==1.7.1
pyparsing==2.4.7
pyrsistent==0.17.3
python-dateutil==2.8.1
pyxdg==0.25
PyYAML==5.3.1
requests==2.25.1
requests-file==1.5.1
SecretStorage==2.3.1
six==1.11.0
tqdm==4.54.1
typing==3.7.4.3
typing-extensions==3.7.4.3
urllib3==1.26.3
virtualenv==16.7.10
zipp==3.4.0 But this is my requirements.txt attrs==20.3.0
boto3==1.17.17
botocore==1.20.17
certifi==2020.12.5
chardet==4.0.0
clearml==0.17.4
furl==2.1.0
future==0.18.2
humanfriendly==9.1
idna==2.10
jmespath==0.10.0
jsonschema==3.2.0
numpy
orderedmultidict==1.0.1
pathlib2==2.3.5
Pillow==8.1.0
psutil==5.8.0
PyJWT==2.0.1
pyparsing==2.4.7
pyrsistent==0.17.3
python-dateutil==2.8.1
PyYAML==5.4.1
requests==2.25.1
requests-file==1.5.1
s3transfer==0.3.4
six==1.15.0
torch==1.7.1
torchvision==0.8.2
typing-extensions==3.7.4.3
urllib3==1.26.3 In particular, i am getting a error as follows. Environment setup completed successfully

Starting Task Execution:

Traceback (most recent call last):
File "pytorch_mnist.py", line 8, in <module>
import torch
ModuleNotFoundError: No module named 'torch'
DONE: Running task '3a90802d1dfa4ec09fbccba0beffbaa8', exit status 1 `

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Sorry i don't quite understand this. The task itself was submitted as I run the code on the client. I suppose the dependancies requirements would be copied over as the experiment is cloned?

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Hi, the problem is the same.

I noticed that its not checking out the latest version in gitlab. This latest version would contain the requirements.txt.
Using cached repository in "/root/.clearml/vcs-cache/pytorchmnist.f220373e7227ec760b28c7f4cd99b534/pytorchmnist" warning: redirecting to Note: checking out 'cfb833bcc70f3e10d3b6a96cfad3225ed682382b'.But i'm guessing this block below applied the diff..does it include the requirements.txt though?
HEAD is now at cfb833b Upload New File type: git url: branch: HEAD commit: cfb833bcc70f3e10d3b6a96cfad3225ed682382b root: /root/.clearml/venvs-builds/3.6/task_repository/pytorchmnist Applying uncommitted changes Executing: ('git', 'apply', '--unidiff-zero'): b"<stdin>:11: trailing whitespace.\n task = Task.init(project_name='MNIST', \n<stdin>:12: trailing whitespace.\n task_name='Pytorch Standard', \nwarning: 2 lines add whitespace errors.\n"

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

ok, I think I missed something on the way then.

you need to have some diffs, because

Applying uncommitted changes Executing: ('git', 'apply', '--unidiff-zero'): b"<stdin>:11: trailing whitespace.\n task = Task.init(project_name='MNIST', \n<stdin>:12: trailing whitespace.\n task_name='Pytorch Standard', \nwarning: 2 lines add whitespace errors.\n"
can you re-run this task from your local machine again? you shouldn’t have anything under UNCOMMITTED CHANGES this time (as we just saw with empty git diff from bash). But before, please verify that the repo have torch in the repo’s requirements.txt file

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Ok that worked. So every time i have changes in codes, i will have to rerun the experiment on my own machine that doesn't have any GPUs?

Kinda defeat the purpose of using ClearML Agent.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

according to this part

Applying uncommitted changes Executing: ('git', 'apply', '--unidiff-zero'): b"<stdin>:11: trailing whitespace.\n task = Task.init(project_name='MNIST', \n<stdin>:12: trailing whitespace.\n task_name='Pytorch Standard', \nwarning: 2 lines add whitespace errors.\n"I don’t see the requirements change, lets try without the cache, can you clear it (ClearML cache dir is located at ~/.clearml )?

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Next step to figure out if i can do all that in the python code instead of UI.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Ok thanks, that worked.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Checking that

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Hi SubstantialElk6 , does the task have a docker image too (you can check it in the UI)?

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Hi SubstantialElk6 , can you update your ClearML agent to the latest ( 0.17.2rc4 )?
pip install clearml-agent== 0.17.2rc4 and try with it?

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

when you do git diff on your terminal about this git repo, do you get the requirements changes too? or the same as in
Applying uncommitted changes Executing: ('git', 'apply', '--unidiff-zero'): b"<stdin>:11: trailing whitespace.\n task = Task.init(project_name='MNIST', \n<stdin>:12: trailing whitespace.\n task_name='Pytorch Standard', \nwarning: 2 lines add whitespace errors.\n"?

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

are you referring to the docker image? The same as before with task.set_base_docker("dockerrepo/mydocker:custom --env GIT_SSL_NO_VERIFY=true")

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

That didn't work as well...

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

I can help you with that 🙂

task.set_base_docker("dockerrepo/mydocker:custom --env GIT_SSL_NO_VERIFY=true")

  				
Posted 
	4 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Ok that works. thanks.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Write your answer

2K Views

29 Answers

4 years ago

2 years ago