Reputation
Badges 1
606 × Eureka!I ll add creating an issue to my todo list
I am wondering cause when used in docker mode, the docker container may have a CUDA Version that is different from the host version. However, ClearML seems to use the host version instead of the docker container's version, which is a problem sometimes.
My clearml-server server crashed for some reason, so I won't be able to verify until tomorrow.
Type "help", "copyright", "credits" or "license" for more information.
>>> from clearml_agent.helper.gpu.gpustat import get_driver_cuda_version
>>> get_driver_cuda_version()
'110'
What I am trying to do it install thistorch == 1.14.0.dev20221205+cu117 torchvision == 0.15.0.dev20221205+cpu
Is this what you mean by specific build?
Let me try it another time. Maybe something else went wrong.
For now I can tell you that with conda_freeze: true
it fails, but with conda_freeze: false
it works!
Hi KindChimpanzee37 I was more asking about the general idea to make these settings task-specific, but thank you for the suggestion anyways, I will definitely apply it.
Ah, thanks a lot. So for example the CleanUp Service ( https://github.com/allegroai/clearml/blob/master/examples/services/cleanup/cleanup_service.py ) should have no troubles deleting the artifacts.
Thanks for answering. I don't quite get your explanation. You mean if I have 100 experiments and I start up another one (experiment "101"), then experiment "0" logs will get replaced?
And how do I specify this in the output_uri
? The default file server is specified by passing True
. How would I specify to use the second?
I guess the supported storage mediums (e.g. S3, ceph, etc...) dont have this issue, right?
Can you give me an example how I can add a second fileserver?
Ah, I see. Any way to make the UI recognize it as a file server?
I will debug this myself a little more.
Thu Mar 11 17:52:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | ...
Let me check again.
I see a python 3 fileserver.py
running on a single thread with 100% load.
Yea, and the script ends with clearml.Task - INFO - Waiting to finish uploads
Seems more like a bug or something is not properly configured on my side.
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit~=11.1.1
- pytorch~=1.8.0
Works fine
AgitatedDove14 Yea, I also had this problem: https://github.com/allegroai/clearml-server/issues/87 I have Samsung 970 Pro 2TB on all machines, but maybe something is missconfigured like SuccessfulKoala55 suggested. I will take a look. Thank you for now!
AlertBlackbird30 Thanks for asking. Just take everything with I grain of salt I say, because I am also not sure whether I do machine learning the correct way 😄
I think you got the right idea. I actually do reinforcement learning (RL), so I have multiple RL-environments and RL-agents. However, while the code for the agents differs between the agents, the glue code is the same. So what I do is I call python run_experiment.py --agent
http://myproject.agents.my ` _agent --environm...