Reputation
Badges 1
606 × Eureka!` apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
- /opt/clearml/config:/opt/clearml/config
- /opt/clearml/data/fileserver:/mnt/fileserver
depends_on:
- redis
- mongo
- elasticsearch
- fileserver
- fileserver_datasets
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
CLEARML_...
I was wondering whether some solution is builtin in clearml, so I do not have to configure each server manually. However, from your answer I take that this is not the case.
Yea, something like this seems to be the best solution.
When the task is aborted I, the logs will show up, but the scalar logs will never appear. The scalar logs only appear when the task finishes.
An upload of 11GB took around 20 hours which cannot be right. Do you have any idea whether ClearML could have something to do with this slow upload speed? If not I am going to start debugging with the hardware/network.
Yea, and the script ends with clearml.Task - INFO - Waiting to finish uploads
So clearml 1.0.1 clearml-agent 1.0.0 and clearml-server from master
What exactly does this mean? The environment is set after the script is started?
I forgot to add this:
` Here is my error:
Traceback (most recent call last):
File "src/run_gym.py", line 25, in <module>
print(os.environ["MUJOCO_GL"])
File "/home/tim/.clearml/venvs-builds/3.7/lib/python3.7/os.py", line 681, in getitem
raise KeyError(key) from None
KeyError: 'MUJOCO_GL' `
This is at the top of my script.
So the environment variables are not set by the clearml-agent, but by clearml itself
Do you know how I can make sure I do not have CUDA or a broken installation installed?
The problem is that clearml installs cudatoolkit=11.0
but cudatoolkit=11.1
is needed. By setting agent.cuda_version=11.1
in clearml.conf
it uses the correct version and installs fine. With version 11.0
conda will resolve conflicts by installing pytorch cpu-version.
@<1576381444509405184:profile|ManiacalLizard2> I ll check again 🙂 thanks
For now I can tell you that with conda_freeze: true
it fails, but with conda_freeze: false
it works!
I have no idea myself, but what the serverfault thread says about man-in-the-middle makes sense. However this also prohibits an automatic solution except for a shared known_hosts file I guess.
I created an github issue because the problem with the slow deletion still exists. https://github.com/allegroai/clearml/issues/586#issue-1142916619
Hi @<1523701070390366208:profile|CostlyOstrich36>
Thanks a lot. It seems there is no such option in clearml-data CLI, right?
Also, is max_workers about compression threads or upload threads or both?
I have to correct myself, I do not even have CUDA installed. Only the driver and everything CUDA-related is provided by the docker container. This works with a container that has CUDA 11.4, but now I have one with 11.6 (latest nvidia pytorch docker).
However, even after changing the clearml.conf and overriding with CUDA_VERSION, the clearml-agent prints on the docker container agent.cuda_version = 114
! (Other changes to the clearml.conf on the agent are reflected in the docker, so only...
Type "help", "copyright", "credits" or "license" for more information.
>>> from clearml_agent.helper.gpu.gpustat import get_driver_cuda_version
>>> get_driver_cuda_version()
'110'