So based on the log it seems the agent is installing:
torch from https://download.pytorch.org/whl/cu102/torch-1.6.0-cp36-cp36m-linux_x86_64.whl
and torchvision from https://torchvision-build.s3-us-west-2.amazonaws.com/1.6.0/gpu/cuda-11-0/torchvision-0.7.0a0%2B78ed10c-cp36-cp36m-manylinux1_x86_64.whl
See in the log:
Warning, could not locate PyTorch torch==1.6.0 matching CUDA version 110, best candidate 1.7.0But torchvision is downloaded from the cuda 11 folder...
I think there is a mismatch between these two versions.
Is the agent running in docker mode? If so, could it be torch is preinstalled on the system, and it should use it?
Yes that is basically it, except our notebook kernel is called "Python 3 (PyTorch 1.6 Python 3.6 GPU Optimized)". I will retry tomorrow, thanks for your help so far. My other questions was about the clearml agent cloning the repo, when running the optimization experiments. Would it suffice to provide the git credentials in the clearml config file or are there additional steps necessary, for the agent to correctly clone the repo?
This is exactly what I did here, and it is working 😞
Would it suffice to provide the git credentials ...
That should be enough, basically this is where they should be:
BTW: from the instance name it seems like it is a VM with preinstalled pytorch, why don't you add system site packages, so the venv will inherit all the preinstalled packages, it might also save some space 🙂
DeterminedToad86 see here:
Change it on the agent's conf file to:
I mean clone the Task in the UI (right click Clone), then go to the execution Tab, to the "installed packages" section, then click on Edit -> go to the torchvision http link, and replace it with
torchvision == 0.7.0 and save.
Then right enqueue the Task (to the default queue) and see if the Agent can run it,
DeterminedToad86 Make sense ?
. Is there any known issue with amazon sagemaker and ClearML
On the contrary it actually works better on Sagemaker...
Here is what I did on sage maker, created:
created a new sagemaker instance opened jupyter notebook Started a new notebook conda_python3 / conda_py3_pytorchIn then I just did "!pip install clearml" and Task.init
Is there any difference ?
I just verified on a clean sagemaker instance everything should just work, see here: https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution Yes if you have more than one file (either notebook or python script) than you must have a git repo, in order to run the task using the Agent.
We are using pytorchs train_one_epoch and evaluate function, for which we had to explicitly copy the engine.py torch code in the directory of our notebook. So the notebook is referencing this file "from engine import train_one_epoch, eval". Could this be an issue?
Yes I think this is the issue, on SageMaker a specific compiled version of torchvision was installed (probably part of the image)
Edit the Task (before enqueuing) and change the torchvision URL to:
torchvision==0.7.0Let me know if it worked
Hello AgitatedDove14 So we put all our code into .py files and clearml was able to recognize the training files and clone them. Now, we have encountered another issue in the optimization experiment, regarding pytorch and clearml-agent: "Run time error: Object has no attribute nms". This seems to be a torchvision installation issue, where apparantly the compiled .so files cannot be found: https://gitmemory.com/issue/pytorch/vision/2239/637896499 . Can we do something about it? Shouldn't pytorch be correctly installed by the agent automatically?