Reputation
Badges 1
70 × Eureka!using this code in https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run...
I'm running now the the code shown above and will let you know if there is still an issue
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run(args) `I added it to this script and use it as a starting point https://github.com/facebookresearch/fastMRI/bl...
When using clearml-agent daemon --queue default --docker
it is running. In this case I always had some issues when adding the --gpu
flag.
` 2021-05-06 13:46:34.032391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:a1:00.0 name: NVIDIA Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-05-06 13:46:34.032496: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: ...
tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
yes, this one is running in venv and not docker, because I had some issues with cuda and docker. The virtualenv==20.4.6
in the requirements.txt. I think it broke after installing clearml-serving
in the same env.
clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.0 causes not using the GPUs because of missing libs.
I like this approach more but it still requires resolved environment variables inside the clearml.conf
SuccessfulKoala55 Hey, for us artifact download urls, model download urls, images in plots and debug image urls are broken. In the linked example I can see a solution for the debug images and potentially plot images but cant find the artifacts and model urls inside ES. Are those urls maybe stored inside the mongodb? Any idea where to find them?
Hey AgitatedDove14 , I fixed my code issue and are now able to train on multiple gpus using the https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py . Since I create the ClearML Task in the main thread I now can't see any training plots and probably also not the output model. What would be the right approach? I would like to avoid using Task.current_task().upload_artifact()
or manual logging. I really enjoy the automatic detection
My code produces now an error inside one of the threads, but that should be an issue on my side. Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the if __name__ == "__main__":
as seen above in the code snippet.
This happens inside the agent, since I use task.execute_remotely() I guess. The agent runs on ubuntu 18.04 and not in docker mode
RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting
Actually I saw that the RuntimeError: context has already been set
appears when the task is initialised outside if __name__ == "__main__":
` Process failed, exit code 1task ab1a90dacb9042eea8e4a6a16640d7f4 pulled from 8f06b6b160c14a3591d791c1885b309e by worker test:gpu1
Running task 'ab1a90dacb9042eea8e4a6a16640d7f4'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.kbkz1n40.txt', '/tmp/.clearml_agent_out.kbkz1n40.txt'
Current configuration (clearml_agent v1.0.0, location: /tmp/.clearml_agent.3e6l7juj.cfg):
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes ...
AgitatedDove14 I created a new clean venv and freshly installed the clearml-agent
under python / pip 3.8 and now it is working again. Still don't know what caused this issue. Thank you very much for helping!
docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash
worked, I could run in it nvidia-smi
and see gpu 0
the error your are citing happens when running clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda
Hi AgitatedDove14 , I get an error when running a task on my worker. I have looked into /home/user/.clearml/venvs-builds
but it is empty. Any idea why this happens? I actually don’t know what I changed to cause this issue… I’m running clearml-agent v1.0.0
clearml_agent: ERROR: Command '['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']' returned non-zero exit status 1.
One more thing: The dockerized version is still not working as I want it to. If I use any specific docker image like docker: nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04
on a host machine with NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3
I always get a similar error as above where a lib is missing. If I use the example from http://clear.ml clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda
I always get this error ` docker: Error...
I think Anna means that if artifacts and models are stored on the clearml fileserver their path will contain the IP or domain of the fileserver. If you then move the fileserver to a different host, all the urls are broken since the host changed.
I increased already the memory to 8GB after reading similar issues here on the slack`
Just making sure, how exactly did you do that?
docker-compose down
elasticsearch: networks: - backend container_name: clearml-elastic environment: ES_JAVA_OPTS: -Xms8g -Xmx8g `` docker-compose up -d
Yes, this happened when the disk got filled up to 100%
Solving the replica issue now allowed me to get better insights into why the one index is red.
` {
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-11-09T22:30:47.018Z",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a...
Did you wait for all the other indices to reach yellow status?
yes I waited until everything was yellow
using top
inside the elasticsearch container shows elastic+ 20 0 17.0g 8.7g 187584 S 2.3 27.2 1:09.18 java
that the 8g are reserved. So setting ES_JAVA_OPTS: -Xms8g -Xmx8g
should work.