Hi All, I Use Autoscalers In My Training Configuration, And I Have An Issue With Them.

Answered

Hi all,
I use autoscalers in my training configuration, and I have an issue with them.

The issue:
Currently, I fail to configure autoscaler that will successfully launch training agent.

My configuration:
While autoscaler configuration requires "base docker image", it was possible in the past to keep it empty in order to run the training in the EC2 image itself.
Now when I try to configure a new autoscaler, it requires an base docker image.

When I try to put a space it lunches and but falis due to: "Unable to find image 'nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04' locally"
which looks like it decided on a docker by itself.
can anyone assist me with that?

Thanks

  				
Posted 
	one year ago

					More
				  		
  Report
		
					QuaintOwl32
				
					0
					 × 1

Votes Newest

Answers 15

I doubt that would be possible because it looks like the autoscaler versions are global
As a quick workaround you can launch the open source autoscaler until the no-docker capability is available again.
None

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Great! thanks a lot!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					QuaintOwl32
				
					0
					 × 1

this is an urgent issue for me, as this broke my training flow

  				
Posted 
	one year ago

					More
				  		
  Report
		
					QuaintOwl32
				
					0
					 × 1

Is there a possibility to relaunch my old autoscaler as it was? at least until the support for no-docker configuration is back? I don't care if you do it @<1574207105437536256:profile|HungryCat90>

  				
Posted 
	one year ago

					More
				  		
  Report
		
					QuaintOwl32
				
					0
					 × 1

That or a private docker registry

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

will my code run inside of this docker? if so it won't work as my environment is in the host linux

  				
Posted 
	one year ago

					More
				  		
  Report
		
					QuaintOwl32
				
					0
					 × 1

is there a workaround for the meantime?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					QuaintOwl32
				
					0
					 × 1

Of course, but in my case its very complicated to create this image

  				
Posted 
	one year ago

					More
				  		
  Report
		
					QuaintOwl32
				
					0
					 × 1

Hi @<1708653001188577280:profile|QuaintOwl32> , the support for this option was temporarily removed, but will be added back soon - we'll update here

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

My aws image is configured to support my training. As docker is separated from the host system my training will not work on it.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					QuaintOwl32
				
					0
					 × 1

You can always add the relevant configurations to the docker image itself as well. From my understanding a new version should be released towards the end of the month and with it the ability to run without docker image required on the autoscaler

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi @<1708653001188577280:profile|QuaintOwl32> , you can set some default image to use. My default for most jobs is nvcr.io/nvidia/pytorch:23.03-py3

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I will try to create a docker image.
What ways do I have to upload the image to be used by autoscaler? do I have to use docker-hub?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					QuaintOwl32
				
					0
					 × 1

Updating that a newer version of the autoscaler was deployed

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yes, this will cause the code to run inside the container.

if so it won't work as my environment is in the hist linux

Not sure I understand this part, can you please elaborate?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

1K Views

15 Answers

one year ago