Reputation
Badges 1
611 × Eureka!An upload of 11GB took around 20 hours which cannot be right. Do you have any idea whether ClearML could have something to do with this slow upload speed? If not I am going to start debugging with the hardware/network.
My agent shows the same as before:
` ...
Environment setup completed successfully
Starting Task Execution:
DONE: Running task 'aff7c6605b7243d38968f95b4351b127', exit status 0 `
It could be that either the clearml-server has bad behaviour while clean up is ongoing or even after.
Yes, I did not change this part of the config.
Hey, that is unfortunately not possible as there are multiple projects from different users.
pytorch.tensorboard is the same as tensorboardx https://github.com/pytorch/pytorch/blob/6d45d7a6c331ddb856ac34a76bcd3613aa05185b/torch/utils/tensorboard/summary.py#L461
Yea, tensorboardX is using moviepy.
Interesting. Will probably only matter for very small experiments or experiments, where validation is run very infrequently.
# Python 3.7.10 (default, Feb 26 2021, 18:47:35) [GCC 7.3.0]
aiostream==0.4.2
attrs==20.3.0
clearml==0.17.4
dm-control==0.0.355168290
dm-env==1.4
furl==2.1.0
future==0.18.2
glfw==2.1.0
gym==0.18.0
humanfriendly==9.1
imageio-ffmpeg==0.4.3
jsonschema==3.2.0
labmaze==1.0.3
lxml==4.6.2
moviepy==1.0.3
orderedmultidict==1.0.1
pathlib2==2.3.5
pillow==7.2.0
proglog==0.1.9
psutil==5.8.0
pybullet==3.0.9
pygame==2.0.1
pyglet==1.5.0
pyjwt==2.0.1
pyrsistent==0.17.3
requests-file==1.5.1
tensorboard...
How can I get the agent log?
What exactly do you mean by docker run permissions?
Thanks for the answer. So currently the cleanup is done based number of experiments that are cached? If I have a few big experiments, this could make my agents cache overflow?
In the WebUI it just shows that an error happened after the loading bar has been running for a while.
I tried to delete the same tasks again and this time, it instantly confirmed deletion and the tasks are gone.
Alright, thanks. Would be a nice feature 🙂
Here is a part of the cleanup service log. Unfortunately, I cannot even download the full log currently, because the clearml-server will just throw errors for everything.
I was wrong: I think it uses the agent.cuda_version , not the local env cuda version.
Now the pip packages seems to ship with CUDA, so this does not seem to be a problem anymore.
Yea, is there a guarantee that the clearml-agent will not crash because it did not clean the cache in time?
For everyone who had the patience to read through everything, here is my solution to make clearml work with ssh-agent forwarding in the current version:
Start and ssh-agent Add ssh keys with ssh-add to agent echo $SSH_AUTH_SOCK and paste into clearml.conf as here: https://github.com/allegroai/clearml-agent/issues/45#issuecomment-779302144 (replace $SSH_AUTH_SOCKET with actually value) Move all the files except known_hosts out of ~/.ssh of the clearml-agent workstation. Start the...
Okay, I will increase it and try again.
Seems like some experiments cannot be deleted
Or better some cache option. Otherweise the cron job is what I will use 🙂 Thanks again
` apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
- /opt/clearml/config:/opt/clearml/config
- /opt/clearml/data/fileserver:/mnt/fileserver
depends_on:
- redis
- mongo
- elasticsearch
- fileserver
- fileserver_datasets
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
CLEARML_...
I randocker run -it -v /home/hostuser/.ssh/:/root/.ssh ubuntu:18.04but cloning does not work and this is what ls -lah /root/.ssh gives inside the docker container:
` -rw------- 1 1001 1001 1.5K Apr 8 12:28 authorized_keys
-rw-rw-r-- 1 1001 1001 208 Apr 29 09:15 config
-rw------- 1 1001 1001 432 Apr 8 12:53 id_ed25519
-rw-r--r-- 1 1001 1001 119 Apr 8 12:53 id_ed25519.pub
-rw------- 1 1001 1001 432 Apr 29 09:16 id_gitlab
-rw-r--r-- 1 1001 1001 119 Apr 29 09:25 id_gitlab.pub
-...
Alright, thank you. I will try to debug further