clearml will register conda packages that cannot be installed if clearml-agent is configured to use pip. So although it is nice that a complete package list is tracked, it makes it cumbersome to rerun the experiment.
pytorch.tensorboard is the same as tensorboardx https://github.com/pytorch/pytorch/blob/6d45d7a6c331ddb856ac34a76bcd3613aa05185b/torch/utils/tensorboard/summary.py#L461
To summarize: The scheduler should assign tasks the the agent first, which gives a queue the highest priority.
I used the wrong docker container. The docker container I used had version 11.4. Interestingly, the override from clearml.conf and CUDA_VERSION Env variable did not work there.
With the correct docker container everything works fine. Shame on me.
` apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
- /opt/clearml/config:/opt/clearml/config
- /opt/clearml/data/fileserver:/mnt/fileserver
depends_on:
- redis
- mongo
- elasticsearch
- fileserver
- fileserver_datasets
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
CLEARML_...
I am not sure what happened, but my experiments are gone. However, the data directory is still filled.
I got the error again. Seems to happen only when I try to delete "large" experiments.
I guess it started with the usage of the cleanup_service.
So I just tried again, but with manual deleting via Web UI.