Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello, We Are Currently Working On A Hyperparameter Tuning Job For Object Detection Following This Tutorial

Hello, we are currently working on a hyperparameter tuning job for object detection following this tutorial https://allegro.ai/docs/examples/frameworks/pytorch/notebooks/image/hyperparameter_search/ . We have set up two jupyter notebooks in Amazon Sagemaker, one containing the training code and one the optimization code. We also initialized clear ml agent, but get the error "clearml_agent: ERROR: Failed cloning repository", when running the experiment. Does clearml only work with a git repository or also with jupyter notebooks? And if so, how would we connect our git repo, containing the code, with clearml? Might be a naive question, but this is our first time using ClearML ;D

  
  
Posted 3 years ago
Votes Newest

Answers 30


DeterminedToad86
So based on the log it seems the agent is installing:
torch from https://download.pytorch.org/whl/cu102/torch-1.6.0-cp36-cp36m-linux_x86_64.whl
and torchvision from https://torchvision-build.s3-us-west-2.amazonaws.com/1.6.0/gpu/cuda-11-0/torchvision-0.7.0a0%2B78ed10c-cp36-cp36m-manylinux1_x86_64.whl

See in the log:
Warning, could not locate PyTorch torch==1.6.0 matching CUDA version 110, best candidate 1.7.0But torchvision is downloaded from the cuda 11 folder...
I think there is a mismatch between these two versions.
Is the agent running in docker mode? If so, could it be torch is preinstalled on the system, and it should use it?

  
  
Posted 3 years ago

Hello AgitatedDove14 So we put all our code into .py files and clearml was able to recognize the training files and clone them. Now, we have encountered another issue in the optimization experiment, regarding pytorch and clearml-agent: "Run time error: Object has no attribute nms". This seems to be a torchvision installation issue, where apparantly the compiled .so files cannot be found: https://gitmemory.com/issue/pytorch/vision/2239/637896499 . Can we do something about it? Shouldn't pytorch be correctly installed by the agent automatically?

  
  
Posted 3 years ago

Hi DeterminedToad86
I just verified on a clean sagemaker instance everything should just work, see here: https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution Yes if you have more than one file (either notebook or python script) than you must have a git repo, in order to run the task using the Agent.

  
  
Posted 3 years ago

Yes that is basically it, except our notebook kernel is called "Python 3 (PyTorch 1.6 Python 3.6 GPU Optimized)". I will retry tomorrow, thanks for your help so far. My other questions was about the clearml agent cloning the repo, when running the optimization experiments. Would it suffice to provide the git credentials in the clearml config file or are there additional steps necessary, for the agent to correctly clone the repo?

  
  
Posted 3 years ago

Would it suffice to provide the git credentials ...

That should be enough, basically this is where they should be:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L18

  
  
Posted 3 years ago

BTW: from the instance name it seems like it is a VM with preinstalled pytorch, why don't you add system site packages, so the venv will inherit all the preinstalled packages, it might also save some space 🙂
DeterminedToad86 see here:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L55
Change it on the agent's conf file to:
system_site_packages: true

  
  
Posted 3 years ago

. Is there any known issue with amazon sagemaker and ClearML

On the contrary it actually works better on Sagemaker...

Here is what I did on sage maker, created:
created a new sagemaker instance opened jupyter notebook Started a new notebook conda_python3 / conda_py3_pytorchIn then I just did "!pip install clearml" and Task.init
Is there any difference ?

  
  
Posted 3 years ago

We are using pytorchs train_one_epoch and evaluate function, for which we had to explicitly copy the engine.py torch code in the directory of our notebook. So the notebook is referencing this file "from engine import train_one_epoch, eval". Could this be an issue?

  
  
Posted 3 years ago

Nicely done DeterminedToad86 🙂
Wasn't this issue resolved by torch?

  
  
Posted 3 years ago

Okay that means it is running in virtual environment mode.
On the original Task (the one you enqueued) what were the installed packages (specifically the torch/torchvision) ?

  
  
Posted 3 years ago

I have not set the --docker flag when running the agent, ran it just the default way "clearml-agent daemon --queue default"

  
  
Posted 3 years ago

Yes I thought so, since it works without issues in sagemaker:

  
  
Posted 3 years ago

DeterminedToad86 I suspect that since it was executed on sagemaker it registered a specific package that is unique for Sagemaker (no to worry installed packages can be edited after you clone/reset the Task)

  
  
Posted 3 years ago

I get the missing notebook problem: when I run the following code in Amazon Sagemaker Notebook Python 3.6 !pip install clearml
from clearml import Task
task_train = Task.init(project_name='Train Task',
task_name='Train Task')

  
  
Posted 3 years ago

DeterminedToad86 were you running a jupyter notebook or a jupyter console ?

  
  
Posted 3 years ago

Do you mean manually over the UI or do I need to put torchvision == 0.7.0 in my requirements.txt and rerun the task in sagemaker ?

  
  
Posted 3 years ago

So these were the installed packages (related to torch/torchvision) in the training task

  
  
Posted 3 years ago

I mean clone the Task in the UI (right click Clone), then go to the execution Tab, to the "installed packages" section, then click on Edit -> go to the torchvision http link, and replace it with torchvision == 0.7.0 and save.
Then right enqueue the Task (to the default queue) and see if the Agent can run it,
DeterminedToad86 Make sense ?

  
  
Posted 3 years ago

DeterminedToad86
Yes I think this is the issue, on SageMaker a specific compiled version of torchvision was installed (probably part of the image)
Edit the Task (before enqueuing) and change the torchvision URL to:
torchvision==0.7.0Let me know if it worked

  
  
Posted 3 years ago

LOL, Okay I'm not sure we can do something that one.
You should probably increase the storage on your instance 🙂

  
  
Posted 3 years ago

I followed these steps, unfortunately the task failed, due to "no space left on device".

  
  
Posted 3 years ago

Yey!
My pleasure 🙂

  
  
Posted 3 years ago

AgitatedDove14 Could this be a clearml agent related issue?

  
  
Posted 3 years ago

Hmm that is odd, it seemed to missed the fact this is a jupyter notbook.
What's the clearml version you are using ?

  
  
Posted 3 years ago

It's clearml-0.17.4. We did a "pip install clearml" in our notebook without providing a specific version.

  
  
Posted 3 years ago

  1. Could you explain how I can reproduce the missing jupyter notebook (i.e. the ipykernel_launcher.py)
  
  
Posted 3 years ago

give me a minute to test

  
  
Posted 3 years ago

Yes this worked. Thank you very much

  
  
Posted 3 years ago

So, I was also able to make it work on google colab but not on sagemaker. Is there any known issue with amazon sagemaker and ClearML? Otherwise, I will clean up all the installed packages and retry again.

  
  
Posted 3 years ago
553 Views
30 Answers
3 years ago
one year ago
Tags