Reputation
Badges 1
611 × Eureka![2021-05-07 10:52:00,282] [9] [WARNING] [elasticsearch] POST ` [status:N/A request:60.058s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.6/http/client.py", lin...
@<1523701087100473344:profile|SuccessfulKoala55> Only when I delete on self-hosted.
@<1523712723274174464:profile|LazyFish41> WebApp: 1.10.0-357 • Server: 1.10.0-357 • API: 2.24
This has been happening with every version of clearml-server ever. Most probably there should be a queue in front of ES, so it does not process to many request at the same time?
Hi CostlyOstrich36 , thank you for answering so quick. I think that s not how it works because if this was true, one would have to always match local machine to servers. Afaik clearml finds the correct PyTorch Version, but I was not sure how (custom vs pip does it)
So my network seems to be fine. Downloading artifacts from the server to the agents is around 100 MB/s, while uploading from the agent to the server is slow.
Tested with clearml-agent 1.0.1rc4/1.2.2 and clearml 1.3.2
I am wondering cause when used in docker mode, the docker container may have a CUDA Version that is different from the host version. However, ClearML seems to use the host version instead of the docker container's version, which is a problem sometimes.
Nvm, I think its my mistake. I will investigate.
I used the wrong docker container. The docker container I used had version 11.4. Interestingly, the override from clearml.conf and CUDA_VERSION Env variable did not work there.
With the correct docker container everything works fine. Shame on me.
But yeah, I see the point of enterprise having this feature and basic not 🙂
Mhhm, then maybe it is not clear 😂 to me how clearml.Task is meant to be used. I thought of it as being a container for all the information regarding a single experiment that is reflected on the server-side and by this in the WebUI. Now I init() a Task and it will show in the WebUI. I thought after initialization I can still update the task to my liking, i.e. it being a documentation of my experiment.
AgitatedDove14 fyi I think this is the issue I have: https://stackoverflow.com/a/65526944/3038183
I have to correct myself, I do not even have CUDA installed. Only the driver and everything CUDA-related is provided by the docker container. This works with a container that has CUDA 11.4, but now I have one with 11.6 (latest nvidia pytorch docker).
However, even after changing the clearml.conf and overriding with CUDA_VERSION, the clearml-agent prints on the docker container agent.cuda_version = 114 ! (Other changes to the clearml.conf on the agent are reflected in the docker, so only...
- solves it. I did not know this is possible.
The problem is that clearml installs cudatoolkit=11.0 but cudatoolkit=11.1 is needed. By setting agent.cuda_version=11.1 in clearml.conf it uses the correct version and installs fine. With version 11.0 conda will resolve conflicts by installing pytorch cpu-version.
clearml==0.17.4
` task dca2e3ded7fc4c28b342f912395ab9bc pulled from a238067927d04283842bc14cbdebdd86 by worker redacted-desktop:0
Running task 'dca2e3ded7fc4c28b342f912395ab9bc'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.vjg4k7cj.txt', '/tmp/.clearml_agent_out.vjg4k7cj.txt'
Current configuration (clearml_agent v0.17.1, location: /tmp/.clearml_agent.us8pq3jj.cfg):
agent.worker_id = redacted-desktop:0
agent.worker_name = redacted-desktop
agent.force_git_ssh...
Thu Mar 11 17:52:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | ...
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=conda_forge
_openmp_mutex=4.5=1_llvm
absl-py=0.12.0=pypi_0
aiostream=0.4.2=pypi_0
attrs=20.3.0=pypi_0
blas=1.0=mkl
bzip2=1.0.8=h7b6447c_0
ca-certificates=2020.10.14=0
cached-property=1.5.2=pypi_0
cachetools=4.2.1=pypi_0
certifi=2020.6.20=py37_0
chardet=4.0.0=pypi_0
clearml=0.17.4=pypi_0
cloudpickle=1.6.0=py_0
cudatoolkit=11.1.1=h6406543_8
cycler...
First one is the original, second one the clone
Btw: Is it intented that the folder structures in the fileserver directories is not deleted?
I think sometimes there can be dependencies that require a newer pip version or something like that. I am not sure though. Why can we even change the pip version in the clearml.conf?
Tried to install cudatoolkit==11.1 manually in this environemnt and got:
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Package xz conflicts for:
python=3....