. Is there any known issue with amazon sagemaker and ClearML
On the contrary it actually works better on Sagemaker...
Here is what I did on sage maker, created:
created a new sagemaker instance opened jupyter notebook Started a new notebook conda_python3 / conda_py3_pytorchIn then I just did "!pip install clearml" and Task.init
Is there any difference ?
So, I was also able to make it work on google colab but not on sagemaker. Is there any known issue with amazon sagemaker and ClearML? Otherwise, I will clean up all the installed packages and retry again.
DeterminedToad86 were you running a jupyter notebook or a jupyter console ?
Yes I thought so, since it works without issues in sagemaker:
I have not set the --docker flag when running the agent, ran it just the default way "clearml-agent daemon --queue default"
I mean clone the Task in the UI (right click Clone), then go to the execution Tab, to the "installed packages" section, then click on Edit -> go to the torchvision http link, and replace it with torchvision == 0.7.0
and save.
Then right enqueue the Task (to the default queue) and see if the Agent can run it,
DeterminedToad86 Make sense ?
LOL, Okay I'm not sure we can do something that one.
You should probably increase the storage on your instance 🙂
BTW: from the instance name it seems like it is a VM with preinstalled pytorch, why don't you add system site packages, so the venv will inherit all the preinstalled packages, it might also save some space 🙂
DeterminedToad86 see here:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L55
Change it on the agent's conf file to:system_site_packages: true
We are using pytorchs train_one_epoch and evaluate function, for which we had to explicitly copy the engine.py torch code in the directory of our notebook. So the notebook is referencing this file "from engine import train_one_epoch, eval". Could this be an issue?
- Could you explain how I can reproduce the missing jupyter notebook (i.e. the ipykernel_launcher.py)
I get the missing notebook problem: when I run the following code in Amazon Sagemaker Notebook Python 3.6 !pip install clearml
from clearml import Task
task_train = Task.init(project_name='Train Task',
task_name='Train Task')
This is exactly what I did here, and it is working 😞
https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution
DeterminedToad86 I suspect that since it was executed on sagemaker it registered a specific package that is unique for Sagemaker (no to worry installed packages can be edited after you clone/reset the Task)
Nicely done DeterminedToad86 🙂
Wasn't this issue resolved by torch?
I followed these steps, unfortunately the task failed, due to "no space left on device".
Do you mean manually over the UI or do I need to put torchvision == 0.7.0 in my requirements.txt and rerun the task in sagemaker ?
It's clearml-0.17.4. We did a "pip install clearml" in our notebook without providing a specific version.
Okay that means it is running in virtual environment mode.
On the original Task (the one you enqueued) what were the installed packages (specifically the torch/torchvision) ?
DeterminedToad86
Yes I think this is the issue, on SageMaker a specific compiled version of torchvision was installed (probably part of the image)
Edit the Task (before enqueuing) and change the torchvision URL to:torchvision==0.7.0
Let me know if it worked
DeterminedToad86
So based on the log it seems the agent is installing:
torch from https://download.pytorch.org/whl/cu102/torch-1.6.0-cp36-cp36m-linux_x86_64.whl
and torchvision from https://torchvision-build.s3-us-west-2.amazonaws.com/1.6.0/gpu/cuda-11-0/torchvision-0.7.0a0%2B78ed10c-cp36-cp36m-manylinux1_x86_64.whl
See in the log:Warning, could not locate PyTorch torch==1.6.0 matching CUDA version 110, best candidate 1.7.0
But torchvision is downloaded from the cuda 11 folder...
I think there is a mismatch between these two versions.
Is the agent running in docker mode? If so, could it be torch is preinstalled on the system, and it should use it?
Would it suffice to provide the git credentials ...
That should be enough, basically this is where they should be:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L18
Yes that is basically it, except our notebook kernel is called "Python 3 (PyTorch 1.6 Python 3.6 GPU Optimized)". I will retry tomorrow, thanks for your help so far. My other questions was about the clearml agent cloning the repo, when running the optimization experiments. Would it suffice to provide the git credentials in the clearml config file or are there additional steps necessary, for the agent to correctly clone the repo?
Hi DeterminedToad86
I just verified on a clean sagemaker instance everything should just work, see here: https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution Yes if you have more than one file (either notebook or python script) than you must have a git repo, in order to run the task using the Agent.
Hello AgitatedDove14 So we put all our code into .py files and clearml was able to recognize the training files and clone them. Now, we have encountered another issue in the optimization experiment, regarding pytorch and clearml-agent: "Run time error: Object has no attribute nms". This seems to be a torchvision installation issue, where apparantly the compiled .so files cannot be found: https://gitmemory.com/issue/pytorch/vision/2239/637896499 . Can we do something about it? Shouldn't pytorch be correctly installed by the agent automatically?
Hmm that is odd, it seemed to missed the fact this is a jupyter notbook.
What's the clearml version you are using ?
So these were the installed packages (related to torch/torchvision) in the training task
AgitatedDove14 Could this be a clearml agent related issue?