Reputation
Badges 1
606 × Eureka!I do not have a global cuda install on this machine. Everything except for the driver is installed via conda.
I tried to run the task with detect_with_conda_freeze: false
instead of true
and got
Executing Conda: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.8 -c defaults -c conda-forge -c pytorch 'pip<20.2' --quiet --json
Pass
Conda: Trying to install requirements:
['pytorch~=1.8.0']
Executing Conda: /home/tim/miniconda3/condabin/conda env update -p /home/tim/.clearml/venvs-builds/3.8 --file /tmp/conda_envh7rq4qmc.yml --quiet --json
Conda error: Unsati...
ca-certificates 2021.1.19 h06a4308_1
certifi 2020.12.5 py38h06a4308_0
cudatoolkit 11.0.221 h6bb024c_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
ncurses ...
clearml will register preinstalled conda packages as requirements.
So it seems to be definitely a problem with docker and not with clearml. However, I do not get, why it works for you but on none of my machine (all Ubuntu 20.04 with docker 20.10)
In the beginning my config file was not empty 😕
Related to this: How does the local cache/agent cache work? Are the sdk.storage.cache
parameters for the agent? When are datasets deleted from cache? When are datasets deleted if I run local execution?
I only added# Python 3.8.2 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] --extra-index-url
clearml torch == 1.14.0.dev20221205+cu117 torchvision == 0.15.0.dev20221205+cpu
and I used a amd64/ubuntu:20.04
docker image with python3.8 . Same error. If it is not too much to ask, could you try to run it with this docker image?
However, I cloned the experiment again via the web UI. Then I enqueued it.
Hey, thank you for answering.
I know this issue and I have it sometimes, but my current issue is a direct result of me trying to make SSL work. So I am not asking for help in solving my problem, but only for help how to debug. Finding out which step leads to the artifact not being deleted (e.g. the fileserver cannot be reached by from wherever the delete request is send)
At least when you use docker containers the agent will reuse the existing python environment.
I just checked and my user is part of the docker group.
Sounds like a good hack, but not like a good solution 😄 But thank you anyways! 🙂
No reason in particular. How many people work at http://allegro.ai ?
Thanks a lot. But even for a user, I can not set a default for all projects, right?
Here is a part of the cleanup service log. Unfortunately, I cannot even download the full log currently, because the clearml-server will just throw errors for everything.
I see, I just checked the logs and it showsurllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f246f0d6c18>: Failed to establish a new connection: [Errno 111] Connection refused [2022-04-29 08:45:55,018] [9] [WARNING] [elasticsearch] POST
[status:N/A request:0.000s]
Unfortunetely, there are no logs in /usr/share/elasticsearch/logs
to see what elastic was up to
Or maybe a different question: What is not
Artifacts and Models. debug samples (or anything else the Logger class creates)
?
Also it is not possible to use multiple files server? E.g. log tasks on different S3 buckets without changing clearml.conf
Thank you. I am not trying to use this option to speed up the setup. I have some package (carla simulator PythonAPI) that has no pip support (only easy_install). So I am thinking about just installing this manually on the worker, so that tasks can assume, that carla is provided by the system
I guess then it is hard to solve and probably not worth it for me to make suggestions without any knowledge about the internals 😕 Seems like a small weakness in the design of the open-source version. But not much of an issue 🙂
When I go into the GUI there are no artifacts displayed.
Yea, I am still trying to get docker to work with clearml. I do not have much experience with docker besides creating Dockerfiles and it seems like the ~/.ssh/config
ownership seems broken when mounted into the container on my workstations.
I can put anything there: s3://my_minio_instance:9000 /bucket_that_does_not_exist
and it will work.
I have set default_output_uri
to s3://my_minio_instance:9000/clearml
If I set files_server
to s3://my_minio_instance:9000 /bucket_that_does_not_exist
it fails at uploading metrics, but model upload still works:
WARNING - Failed uploading to
s3://my_minio_instance:9000/ bucket_that_does_not_exist
('NoneType' object has no attribute 'upload')
clearml.Task - INFO - Completed model upload to
s3://my_minio_instance:9000/clearml
What is ` default_out...