Reputation
Badges 1
611 × Eureka!I got the error again. Seems to happen only when I try to delete "large" experiments.
Very nice!
Maybe for the long-term future you could look into how to make better use of vertical space. Currently, there are 7 (5 in fullscreen mode)= different sections from content to top of the page. Maybe a compact mode would be nice or less space for content headlines.
Now I get:
ollecting package metadata (repodata.json): done
Solving environment: -
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
...
I installed my local conda environment from an environment.yml without issues, so maybe clearml makes some changes that leads to conflicts which finally leads to the cpu-version install.
Thank you very much, didnt know about that 🙂
@<1523701087100473344:profile|SuccessfulKoala55> I just did the following (everything locally, not with clearml-agent)
- Set my credentials and S3 endpoint to A
- Run a task with Task.init() and save a debug sample to S3
- Abort the task
- Change my credentials and S3 endpoint to B
- Restart the taskThe result are lingering files in A that seem not to be associated with the task. I would expect ClearML to instead error the task or to track the lingering files somewhere, so they can ma...
I think in the paid version there is this configuration vault, so that the user can pass their own credentials securely to the agent.
I use fixed users!
Perfect, just what I always wanted. Looking forward to the MinIo version. Thank you:)
In my case I use the conda freeze option and do not even have CUDA installed on the agents.
So I just updated the env that clearml-agent created (and where pytorch cpu is installed) with my local environment.yml and now the correct version is installed, so most probably the `/tmp/conda_envaz1ne897.yml`` is the problem here
[2021-05-07 10:52:00,282] [9] [WARNING] [elasticsearch] POST ` [status:N/A request:60.058s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.6/http/client.py", lin...
@<1523701087100473344:profile|SuccessfulKoala55> Only when I delete on self-hosted.
@<1523712723274174464:profile|LazyFish41> WebApp: 1.10.0-357 • Server: 1.10.0-357 • API: 2.24
This has been happening with every version of clearml-server ever. Most probably there should be a queue in front of ES, so it does not process to many request at the same time?
Thanks for researching this issue. If you have time, you can create the issue since you are way more knowledgeable, but I can also open it if you do not have time 🙂
Hi CostlyOstrich36 , thank you for answering so quick. I think that s not how it works because if this was true, one would have to always match local machine to servers. Afaik clearml finds the correct PyTorch Version, but I was not sure how (custom vs pip does it)
So my network seems to be fine. Downloading artifacts from the server to the agents is around 100 MB/s, while uploading from the agent to the server is slow.
Tested with clearml-agent 1.0.1rc4/1.2.2 and clearml 1.3.2
I am wondering cause when used in docker mode, the docker container may have a CUDA Version that is different from the host version. However, ClearML seems to use the host version instead of the docker container's version, which is a problem sometimes.
Nvm, I think its my mistake. I will investigate.
I used the wrong docker container. The docker container I used had version 11.4. Interestingly, the override from clearml.conf and CUDA_VERSION Env variable did not work there.
With the correct docker container everything works fine. Shame on me.
Ok. I just wanted to make sure I have configured my agent properly. Just want to make sure I have to set it on all agents.
But yeah, I see the point of enterprise having this feature and basic not 🙂
@<1523701205467926528:profile|AgitatedDove14> Thank you very much for your guidance. Setting these manually works for me!