RobustRat47 It can also simply be that the instance type you declared is not available in the zone you defined
Try to spin up the instance of that type manually in that region to see if it is available
AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)
amazon linux
Ho I see, I think we are now touching a very important point:
I thought that torch wheels already included cuda/cudnn libraries, so you don't need to care about the system cuda/cudnn version because eventually only the cuda/cudnn libraries extracted from the torch wheels were used. Is this correct? If not, then does that mean that one should use conda to install the correct cuda/cudnn cudatoolkit?
That was also my feeling! But I though that spawning the trains-agent from a conda env would isolate me from cuda drivers on the system
yes, that's also what I thought
Not really: I just need to find the one that is compatible with torch==1.3.1
Nevermind, I just saw report_matplotlib_figure
π
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
Yes I agree, but I get a strange error when using dataloaders:RuntimeError: [enforce fail at context_gpu.cu:323] error == cudaSuccess. 3 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:323: initialization error
only when I use num_workers > 0
Probably 6. I think because of some reason, it did not go back to main trains-agent. Nevertheless I am not sure, because a second task could start. It could also be that the second was aborted for some reason while installing task requirements (not system requirements, so executing the trains-agent setup within the docker container) and therefore again it couldn't go back to main trains-agent. But ps -aux
shows that the trains-agent is stuck running the first experiment, not the second...
and the agent says agent.cudnn_version = 0
Ho yes, this could work as well, thanks AgitatedDove14 !
thanks for clarifying! Maybe this could be clarified in the agent logs of the experiments with something like the following?agent.cuda_driver_version = ... agent.cuda_runtime_version = ...
Sorry, I didn't get that π
From the answers I saw on the internet, it is most likely related to the mismatch of cuda/cudnn version
What happens is different error but it was so weird that I thought it was related to the version installed
Ok so it seems that the single quote is the reason, using double quotes works
That's why I suspected trains was installing a different version that the one I expected
I would probably leave it to the ClearML team to answer you, I am not using the UI app and for me it worked just well with different regions. Maybe check permissions of the key/secrets?
I did that recently - what are you trying to do exactly?
I guess I can have a workaround by passing the pipeline controller task id to the last step, so that the last step can download all the artifacts from the controller task.
Hi AgitatedDove14 , Here is the full log.
Both python versions (local and remote) are python 3.6 Locally (macos), I get pytorch3d== (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0, 0.4.0, 0.5.0)
Remotely (Ubuntu), I get (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0)
So I guess itβs not related to clearml-agent really, rather pip that cannot find the proper wheel for ubuntu for latest versions of pytorch3d, right? If yes, is there a way to build the wheel on the remote machine...
yea I just realized that you would also need to specify different subnets, etcβ¦ not sure how easy it is π But it would be very valuable, on-demand GPU instances are so hard to spin up nowadays in aws π
Yea so I assume that training my models using docker will be slightly slower so I'd like to avoid it. For the rest using docker is convenient
How about the overhead of running the training on docker on a VM?
I am doing:try: score = get_score_for_task(subtask) except: score = pd.NA finally: df_scores = df_scores.append(dict(task=subtask.id, score=score, ignore_index=True) task.upload_artifact("metric_summary", df_scores)
ha wait, I removed the http://
in the host and it worked π