Reputation
Badges 1
979 × Eureka!I also would like to avoid any copy of these artifacts on s3 (to avoid double costs, since some folders might be big)
I am trying to upload an artifact during the execution
nothing wrong from ClearML side ๐
did you try with another availability zone?
SuccessfulKoala55 I found the issue thanks to you: I changed a bit the domain but didnโt update the apiserver.auth.cookies.domain
setting - I did it, restarted and now it works ๐ Thanks!
RobustRat47 It can also simply be that the instance type you declared is not available in the zone you defined
Try to spin up the instance of that type manually in that region to see if it is available
AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)
amazon linux
Ho I see, I think we are now touching a very important point:
I thought that torch wheels already included cuda/cudnn libraries, so you don't need to care about the system cuda/cudnn version because eventually only the cuda/cudnn libraries extracted from the torch wheels were used. Is this correct? If not, then does that mean that one should use conda to install the correct cuda/cudnn cudatoolkit?
That was also my feeling! But I though that spawning the trains-agent from a conda env would isolate me from cuda drivers on the system
yes, that's also what I thought
Not really: I just need to find the one that is compatible with torch==1.3.1
Nevermind, I just saw report_matplotlib_figure
๐
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
To clarify: trains-agent run a single service Task only
Yes I agree, but I get a strange error when using dataloaders:RuntimeError: [enforce fail at context_gpu.cu:323] error == cudaSuccess. 3 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:323: initialization error
only when I use num_workers > 0
Probably 6. I think because of some reason, it did not go back to main trains-agent. Nevertheless I am not sure, because a second task could start. It could also be that the second was aborted for some reason while installing task requirements (not system requirements, so executing the trains-agent setup within the docker container) and therefore again it couldn't go back to main trains-agent. But ps -aux
shows that the trains-agent is stuck running the first experiment, not the second...
and the agent says agent.cudnn_version = 0
Ho yes, this could work as well, thanks AgitatedDove14 !
thanks for clarifying! Maybe this could be clarified in the agent logs of the experiments with something like the following?agent.cuda_driver_version = ... agent.cuda_runtime_version = ...
Sorry, I didn't get that ๐
From the answers I saw on the internet, it is most likely related to the mismatch of cuda/cudnn version
What happens is different error but it was so weird that I thought it was related to the version installed
Ok so it seems that the single quote is the reason, using double quotes works
That's why I suspected trains was installing a different version that the one I expected
I would probably leave it to the ClearML team to answer you, I am not using the UI app and for me it worked just well with different regions. Maybe check permissions of the key/secrets?
I did that recently - what are you trying to do exactly?