ClearML FAQ | I'M New To Clearml - Trying Few Things - When Using The Offline Mode How Do I Set The Offline Dir To S3? I Would Like To Send Everything From Sagemaker To Some S3 Bucket And Later Import Results To The Server

Answered

I'M New To Clearml - Trying Few Things - When Using The Offline Mode How Do I Set The Offline Dir To S3? I Would Like To Send Everything From Sagemaker To Some S3 Bucket And Later Import Results To The Server - Is That The Right Way To Go?

I'm new to clearml - trying few things - when using the offline mode how do I set the offline dir to s3? I would like to send everything from sagemaker to some s3 bucket and later import results to the server - is that the right way to go?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

Votes Newest

Answers 25

I think for the time being it's not possible to upload automatically to S3. Not sure it's a problem to add support for that but I don't think it's supported ATM (Will double check)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

We can use VPC (which we use, but then the entire bringup of the training would be different)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

That would simplify things 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

It's automatically set in the user's .clearml - I think that /opt/ml is persistent (this is where you are supposed to save checkpoints in sagemaker)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

I'll try it

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

So I think it's necessary to code defensively and once training is done, upload to a remote location (S3 in your case). If disc is persistent this should be a problem as the logs will be saved. Makes sense?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

makes sense.. I currently aws s3 sync every n iterations and then I saw that there is an option to load a dir rather than a zip

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

Task.get_offline_mode_folder()

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Ok - I will look into it

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

Sure - the problem is that many of our trainings in sagemaker are not exposed to the company's server

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

thanks for the quick response! and also - your library/product is really cool and impressive

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

So all training machines will be exposed to the server?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Cool!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Can you elaborate on the use-case a bit more? Why not report directly to the server?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

BTW, just talked to the devs, what happens is that your metrics \ logs are saved locally, then once a task is closed, it's zipped. If you are affraid the instance might be taken from you, first we are planning to release a solution for these situations 🙂 and second your code needs to be aware of the risk and to be able to "resume" training from a specific model snapshot \ iteration.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

You should look at

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Cool and impressive are 2 adjective we like to hear 😄

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

🙏 ❤

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ExcitedReindeer30
				
					0

this should explain how to do it. You get the offline session path once you init the task

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

If you want you can just upload them manually to s3 as the last "line" of the script, or write a pipeline step that does that. Just remember you'll have to import it somehow later on

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

maybe I missed it in the documentation - but I could use also something like set_offline_dir() (to make sure it's pointing opt/ml or something) and then get_offline_file() and upload it myself

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

get_offline_mode_folder

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

If spot is taken from you then yes. It will be. (unless there's some drive persistence)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

but what happens if the script is terminated? maybe a spot termination, ctrl+c, this means I loose track of the training?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

https://clear.ml/docs/latest/docs/guides/set_offline/

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Write your answer

893 Views

25 Answers

3 years ago

one year ago