Reputation
Badges 1
606 × Eureka!Thank you. I am still having the issue. I verified that output_uri
of Task.init works and also clearml-data
with MinIO storage works, but the logger still throws errors
btw: I am pretty sure this used to work, but then stopped work some time ago.
I am going to try it again and send you the relevant part of the logs in a minute. Maybe I am interpreting something wrong.
` =============
== PyTorch ==
NVIDIA Release 22.03 (build 33569136)
PyTorch Version 1.12.0a0+2c916ef ...
Looking in indexes: ,
Requirement already satisfied: pip in /root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (22.0.4)
2022-04-07 16:40:57
Looking in indexes: ,
Requirement already satisfied: Cython in /opt/conda/lib/python3.8/site-packages (0.29.28)
Looking in indexes: ,
Requirement already satisfied: numpy==1.22.3 in /opt/conda/...
Nvm, that does not seem to be a problem. I added a part to the logs in the post above. It shows that some packages are found from conda.
Maybe the difference is that I am using pipnow and I used to use conda! The NVIDIA PyTorch container uses conda. Could that be a reason?
I just manually went into the docker container and ran python -m venv env --system-site-packages
and activated the virtual env.
When I run pip list
then, it correctly shows the preinstalled packages including torch 1.12.0a0+2c916ef
Here it is
That I understand. But I think (old) pip versions will sometimes not resolve a package. Probably not the case the other way around.
Oh, interesting!
So pip version on per task basis makes sense ;D?
The one I posted on top 22.03-py3
😄
Thank you very much! 😃
And how to specify this fileserver as output_uri
?
And how do I specify this in the output_uri
? The default file server is specified by passing True
. How would I specify to use the second?
Any idea why deletion of artifacts on my second fileserver does not work?
fileserver_datasets: networks: - backend - frontend command: - fileserver container_name: clearml-fileserver-datasets image: allegroai/clearml:latest restart: unless-stopped volumes: - /opt/clearml/logs:/var/log/clearml - /opt/clearml/data/fileserver-datasets:/mnt/fileserver - /opt/clearml/config:/opt/clearml/config ports: - "8082:8081"
ClearML successfu...
Btw: It is weird that the fileservers are directly exposed, so no authentication through the webserver is needed. Is this something that is different in the paid version or why is it like that in the open-source version?
Ah, I see. Any way to make the UI recognize it as a file server?
I guess the supported storage mediums (e.g. S3, ceph, etc...) dont have this issue, right?
Ah, okay, that's weird 🙂 Thank you for answering!
Btw: Is it intented that the folder structures in the fileserver directories is not deleted?
Hi SuccessfulKoala55
I meant that in the WebUI deletion should only be allowed for artifacts for which deletion actually works.
For example I now have a lot of lingering artifacts that exist on the fileservers, but not on the clearml-api-server (I think).
Another example: I delete a task via WebUI. ClearML-server tries to delete the task and the artifacts belonging to the task. However, it will show that the task has been successfully deleted but some artifacts have not. Now there is no way...
Yea, when the server handles the deletes everythings fine and imo, that is how it should always have been.
I don't think it is a viable option. You are looking at the best case, but I think you should expect the worst from the users 🙂 Also I would rather know there is a problem and have some clutter than to hide it and never be able to fix it because I cannot identify which artifacts are still in use without spending a lot of time comparing artifact IDs.
Perfect and thank you for your efforts! :)
No problem. Sounds like a good solution, no need to implement something that has already been implemented somewhere else 🙂
I don't think so. It is related to issue with the clearml-server I posted in the other thread. Essentially the clearml-server hangs, then I restart it with docker-compose down && docker-compose up -d
and the experiments sometimes show as running, but on the clearml-agents I see that actually nothing is running or they show as aborted.
I know that usually clearml-agents do not abort on server restart and just continue.
I don't know actually. But Pytorch documentation says it can make a difference: https://pytorch.org/docs/stable/distributions.html#torch.distributions.distribution.Distribution.set_default_validate_args