Reputation
Badges 1
13 × Eureka!I have not set the --docker flag when running the agent, ran it just the default way "clearml-agent daemon --queue default"
So, I was also able to make it work on google colab but not on sagemaker. Is there any known issue with amazon sagemaker and ClearML? Otherwise, I will clean up all the installed packages and retry again.
AgitatedDove14 Could this be a clearml agent related issue?
Yes this worked. Thank you very much
Hello AgitatedDove14 So we put all our code into .py files and clearml was able to recognize the training files and clone them. Now, we have encountered another issue in the optimization experiment, regarding pytorch and clearml-agent: "Run time error: Object has no attribute nms". This seems to be a torchvision installation issue, where apparantly the compiled .so files cannot be found: https://gitmemory.com/issue/pytorch/vision/2239/637896499 . Can we do something about it? Shouldn't pyt...
Do you mean manually over the UI or do I need to put torchvision == 0.7.0 in my requirements.txt and rerun the task in sagemaker ?
So these were the installed packages (related to torch/torchvision) in the training task
Yes that is basically it, except our notebook kernel is called "Python 3 (PyTorch 1.6 Python 3.6 GPU Optimized)". I will retry tomorrow, thanks for your help so far. My other questions was about the clearml agent cloning the repo, when running the optimization experiments. Would it suffice to provide the git credentials in the clearml config file or are there additional steps necessary, for the agent to correctly clone the repo?
I get the missing notebook problem: when I run the following code in Amazon Sagemaker Notebook Python 3.6 !pip install clearml
from clearml import Task
task_train = Task.init(project_name='Train Task',
task_name='Train Task')
Yes I thought so, since it works without issues in sagemaker:
We are using pytorchs train_one_epoch and evaluate function, for which we had to explicitly copy the engine.py torch code in the directory of our notebook. So the notebook is referencing this file "from engine import train_one_epoch, eval". Could this be an issue?
I followed these steps, unfortunately the task failed, due to "no space left on device".
It's clearml-0.17.4. We did a "pip install clearml" in our notebook without providing a specific version.