I got some warnings about broken packages. I cleaned the conda cache with conda clean -a
` and now it installed fine!
Thank you very much, good to know!
For example I get the following error if I simply clone and rerun:ERROR: Could not find a version that satisfies the requirement ruamel_yaml_conda>=0.11.14 (from conda==4.10.1->-r /tmp/cached-reqs6wtc73be.txt (line 28)) (from versions: none) ERROR: No matching distribution found for ruamel_yaml_conda>=0.11.14 (from conda==4.10.1->-r /tmp/cached-reqs6wtc73be.txt (line 28))
I am referring to the UI. The default cleanup service should work with S3 with a correctly configured clearml service agent if I understand the workings correctly.
I usually also experience no problems with restarting the clearml-server. It seems like it has to do with the OOM (or whatever issue I have).
Tested with clearml-agent 1.0.1rc4/1.2.2 and clearml 1.3.2
Shows some logs, but nothing of relevance I think. Only Infos and Warning about deprecated stuff that is still used ;D ...
Thank you. The reports feature is super cool! Greetings to the team. One of the best features for educational use!
Could be clean log after restart. Unfortunately, I restarted the server right away 😞 I gonna post if it happens again with the appropriate logs.
But yeah, I see the point of enterprise having this feature and basic not 🙂
SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted ...
For example I run a task remotely. Now I decide I want to rerun it, but slightly change a parameter. So I clone the task and edit the parameter in the WebUI. Then I submit the task to a queue. When the clearml-agent pulls the tasks and tries to install the requirements, it will fail since the task requirements now contain packages that had been preinstalled in the environment (e.g. nvidia docker). These packages may not be available via pip, so the run will fail.
I see, so it is actually not related to clearml 🎉
I see, I just checked the logs and it showsurllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f246f0d6c18>: Failed to establish a new connection: [Errno 111] Connection refused [2022-04-29 08:45:55,018] [9] [WARNING] [elasticsearch] POST
[status:N/A request:0.000s]
Unfortunetely, there are no logs in /usr/share/elasticsearch/logs
to see what elastic was up to
AnxiousSeal95 Thanks a lot. Seems to be working fine for me. I see the clearml-agent version that pip installs in the docker is now fixed to the host version 🙂 PyTorch Nightly is also installed correctly now!
Now the pip packages seems to ship with CUDA, so this does not seem to be a problem anymore.
Nvm. I think I understood. When the file has never been added to repository it is not tracked.
With clearml==1.4.1 it works, but with the current version it aborts. Here is a log with latest clearml
Maybe this is something that is only possible with the vault of the enterprise version?
First one is the original, second one the clone
I was wrong: I think it uses the agent.cuda_version
, not the local env cuda version.
For everyone who had the patience to read through everything, here is my solution to make clearml work with ssh-agent forwarding in the current version:
Start and ssh-agent Add ssh keys with ssh-add to agent echo $SSH_AUTH_SOCK and paste into clearml.conf as here: https://github.com/allegroai/clearml-agent/issues/45#issuecomment-779302144 (replace $SSH_AUTH_SOCKET with actually value) Move all the files except known_hosts
out of ~/.ssh
of the clearml-agent workstation. Start the...
The problem is that clearml installs cudatoolkit=11.0
but cudatoolkit=11.1
is needed. By setting agent.cuda_version=11.1
in clearml.conf
it uses the correct version and installs fine. With version 11.0
conda will resolve conflicts by installing pytorch cpu-version.
clearml==0.17.4
` task dca2e3ded7fc4c28b342f912395ab9bc pulled from a238067927d04283842bc14cbdebdd86 by worker redacted-desktop:0
Running task 'dca2e3ded7fc4c28b342f912395ab9bc'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.vjg4k7cj.txt', '/tmp/.clearml_agent_out.vjg4k7cj.txt'
Current configuration (clearml_agent v0.17.1, location: /tmp/.clearml_agent.us8pq3jj.cfg):
agent.worker_id = redacted-desktop:0
agent.worker_name = redacted-desktop
agent.force_git_ssh...
MortifiedDove27 Sure did, but I do not understand it very well. Else I would not be asking here for an intuitive explanation 🙂 Maybe you can explain it to me?
My clearml-server server crashed for some reason, so I won't be able to verify until tomorrow.
However, I have not yet found a flexible solution other than ssh-agent forwarding.