Reputation
Badges 1
606 × Eureka!name: core
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- blas=1.0
- bzip2=1.0.8
- ca-certificates=2020.12.5
- certifi=2020.12.5
- cudatoolkit=11.1.1
- ffmpeg=4.3
- freetype=2.10.4
- gmp=6.2.1
- gnutls=3.6.13
- jpeg=9b
- lame=3.100
- lcms2=2.11
- ld_impl_linux-64=2.33.1
- libedit=3.1.20191231
- libffi=3.3
- libgcc-ng=9.3.0
- libiconv=1.16
- libpng=1.6.37
- libstdcxx-ng=9.3.0
- libtiff...
Thank you very much, gonna try that!
So I just updated the env that clearml-agent created (and where pytorch cpu is installed) with my local environment.yml and now the correct version is installed, so most probably the `/tmp/conda_envaz1ne897.yml`` is the problem here
Thanks! I am fascinated by what you guys offer with clearml 🙂
Exactly. I don't want people to circumvent the queue 🙂
The problem is that clearml installs cudatoolkit=11.0
but cudatoolkit=11.1
is needed. By setting agent.cuda_version=11.1
in clearml.conf
it uses the correct version and installs fine. With version 11.0
conda will resolve conflicts by installing pytorch cpu-version.
btw: why is agent.package_manager
and agent attribute. Imo it does not make sense because conda can install pip packages, but pip cannot install conda packages which can lead to install failures, right?
Now I get:
ollecting package metadata (repodata.json): done
Solving environment: -
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
...
I don't think so. It is related to issue with the clearml-server I posted in the other thread. Essentially the clearml-server hangs, then I restart it with docker-compose down && docker-compose up -d
and the experiments sometimes show as running, but on the clearml-agents I see that actually nothing is running or they show as aborted.
I know that usually clearml-agents do not abort on server restart and just continue.
Btw: Is it intented that the folder structures in the fileserver directories is not deleted?
Hi SuccessfulKoala55
I meant that in the WebUI deletion should only be allowed for artifacts for which deletion actually works.
For example I now have a lot of lingering artifacts that exist on the fileservers, but not on the clearml-api-server (I think).
Another example: I delete a task via WebUI. ClearML-server tries to delete the task and the artifacts belonging to the task. However, it will show that the task has been successfully deleted but some artifacts have not. Now there is no way...
Well, I guess no hurdles vs. safety is inherently no solvable. I am all for hurdles, if it is clear how to overcome it. And in my opinion referring to clearml-init
is something which makes sense from a developer and a user perspective.
Ah, okay, that's weird 🙂 Thank you for answering!
Btw: It is weird that the fileservers are directly exposed, so no authentication through the webserver is needed. Is this something that is different in the paid version or why is it like that in the open-source version?
Ah, thanks a lot. So for example the CleanUp Service ( https://github.com/allegroai/clearml/blob/master/examples/services/cleanup/cleanup_service.py ) should have no troubles deleting the artifacts.
One thing I want to add: Maybe you disabling deletion of artifacts if file-server deletion fails. Doesn't make sense that we cannot track existing files if something goes wrong.
Ah, I see. Any way to make the UI recognize it as a file server?
So the service is something that I can right and that intercepts the addition of a task to a queue?
Same here with most recent clearml-server 😞
The docker run command of the agent includes '-v', '/tmp/clearml_agent.ssh.8owl7uf2:/root/.ssh'
and the file permissions are exactly the same.
Yea. Not using the config file does not seem like a good long-term solution for me. However, I still have no idea, why this error happens. But enough for today. Thank you a lot for your help!
Maybe the problem is that I do not start my docker containers from the root
user, so 1001
is a mapping inside the docker to my actual user. Could it be that on the host the owner if your .ssh
files is called root
?
However, I have not yet found a flexible solution other than ssh-agent forwarding.
In the beginning my config file was not empty 😕