Hello, We Are Currently Working On A Hyperparameter Tuning Job For Object Detection Following This Tutorial

Answered

Hello, we are currently working on a hyperparameter tuning job for object detection following this tutorial https://allegro.ai/docs/examples/frameworks/pytorch/notebooks/image/hyperparameter_search/ . We have set up two jupyter notebooks in Amazon Sagemaker, one containing the training code and one the optimization code. We also initialized clear ml agent, but get the error "clearml_agent: ERROR: Failed cloning repository", when running the experiment. Does clearml only work with a git repository or also with jupyter notebooks? And if so, how would we connect our git repo, containing the code, with clearml? Might be a naive question, but this is our first time using ClearML ;D

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

Votes Newest

Answers 30

Could you explain how I can reproduce the missing jupyter notebook (i.e. the ipykernel_launcher.py)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This is exactly what I did here, and it is working 😞
https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi DeterminedToad86
I just verified on a clean sagemaker instance everything should just work, see here: https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution Yes if you have more than one file (either notebook or python script) than you must have a git repo, in order to run the task using the Agent.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So these were the installed packages (related to torch/torchvision) in the training task

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

Yes that is basically it, except our notebook kernel is called "Python 3 (PyTorch 1.6 Python 3.6 GPU Optimized)". I will retry tomorrow, thanks for your help so far. My other questions was about the clearml agent cloning the repo, when running the optimization experiments. Would it suffice to provide the git credentials in the clearml config file or are there additional steps necessary, for the agent to correctly clone the repo?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

We are using pytorchs train_one_epoch and evaluate function, for which we had to explicitly copy the engine.py torch code in the directory of our notebook. So the notebook is referencing this file "from engine import train_one_epoch, eval". Could this be an issue?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

give me a minute to test

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

BTW: from the instance name it seems like it is a VM with preinstalled pytorch, why don't you add system site packages, so the venv will inherit all the preinstalled packages, it might also save some space 🙂
DeterminedToad86 see here:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L55
Change it on the agent's conf file to:
system_site_packages: true

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes I thought so, since it works without issues in sagemaker:

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

Nicely done DeterminedToad86 🙂
Wasn't this issue resolved by torch?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I mean clone the Task in the UI (right click Clone), then go to the execution Tab, to the "installed packages" section, then click on Edit -> go to the torchvision http link, and replace it with torchvision == 0.7.0 and save.
Then right enqueue the Task (to the default queue) and see if the Agent can run it,
DeterminedToad86 Make sense ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It's clearml-0.17.4. We did a "pip install clearml" in our notebook without providing a specific version.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

. Is there any known issue with amazon sagemaker and ClearML

On the contrary it actually works better on Sagemaker...

Here is what I did on sage maker, created:
created a new sagemaker instance opened jupyter notebook Started a new notebook conda_python3 / conda_py3_pytorchIn then I just did "!pip install clearml" and Task.init
Is there any difference ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes this worked. Thank you very much

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

Yey!
My pleasure 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Would it suffice to provide the git credentials ...

That should be enough, basically this is where they should be:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L18

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

DeterminedToad86 I suspect that since it was executed on sagemaker it registered a specific package that is unique for Sagemaker (no to worry installed packages can be edited after you clone/reset the Task)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I have not set the --docker flag when running the agent, ran it just the default way "clearml-agent daemon --queue default"

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

DeterminedToad86
So based on the log it seems the agent is installing:
torch from https://download.pytorch.org/whl/cu102/torch-1.6.0-cp36-cp36m-linux_x86_64.whl
and torchvision from https://torchvision-build.s3-us-west-2.amazonaws.com/1.6.0/gpu/cuda-11-0/torchvision-0.7.0a0%2B78ed10c-cp36-cp36m-manylinux1_x86_64.whl

See in the log:
Warning, could not locate PyTorch torch==1.6.0 matching CUDA version 110, best candidate 1.7.0But torchvision is downloaded from the cuda 11 folder...
I think there is a mismatch between these two versions.
Is the agent running in docker mode? If so, could it be torch is preinstalled on the system, and it should use it?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hello AgitatedDove14 So we put all our code into .py files and clearml was able to recognize the training files and clone them. Now, we have encountered another issue in the optimization experiment, regarding pytorch and clearml-agent: "Run time error: Object has no attribute nms". This seems to be a torchvision installation issue, where apparantly the compiled .so files cannot be found: https://gitmemory.com/issue/pytorch/vision/2239/637896499 . Can we do something about it? Shouldn't pytorch be correctly installed by the agent automatically?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

DeterminedToad86
Yes I think this is the issue, on SageMaker a specific compiled version of torchvision was installed (probably part of the image)
Edit the Task (before enqueuing) and change the torchvision URL to:
torchvision==0.7.0Let me know if it worked

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Do you mean manually over the UI or do I need to put torchvision == 0.7.0 in my requirements.txt and rerun the task in sagemaker ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

I get the missing notebook problem: when I run the following code in Amazon Sagemaker Notebook Python 3.6 !pip install clearml
from clearml import Task
task_train = Task.init(project_name='Train Task',
task_name='Train Task')

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

So, I was also able to make it work on google colab but not on sagemaker. Is there any known issue with amazon sagemaker and ClearML? Otherwise, I will clean up all the installed packages and retry again.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

Hmm that is odd, it seemed to missed the fact this is a jupyter notbook.
What's the clearml version you are using ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay that means it is running in virtual environment mode.
On the original Task (the one you enqueued) what were the installed packages (specifically the torch/torchvision) ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Could this be a clearml agent related issue?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

I followed these steps, unfortunately the task failed, due to "no space left on device".

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeterminedToad86
				
					0
					 × 1

LOL, Okay I'm not sure we can do something that one.
You should probably increase the storage on your instance 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

DeterminedToad86 were you running a jupyter notebook or a jupyter console ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

30 Answers

4 years ago

2 years ago