Hi, Is It Possible To Sync Expiriment Using S3 Or Gs? I Loved To Have A Look At The Some Documentation. We Want To Sync The Training While They Are Running[Not Just When They Are Finished] Thanks,

Answered

Hi,

Is it possible to sync expiriment using s3 or gs?
I loved to have a look at the some documentation.

We want to sync the training while they are running[not just when they are finished]

Thanks,

StaleButterfly40 FYI

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedElk70
				
					0
					 × 1

Votes Newest

Answers 17

DisturbedElk70 Hi 🙂

Can you elaborate?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

We try to sync training jobs from GCP to AWS.
We don't have direct connection from the training instance - hence we need to sync it back to AWS using third party.

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedElk70
				
					0
					 × 1

Hi DisturbedElk70 , I'm not sure I understand what you mean by sync - do you mean store all models/checkpoints to S3?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Nope.

I want to use clearml to manage my expiriments, but I have no access to the server from the instance I'm using.

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedElk70
				
					0
					 × 1

I have no access to the server from the instance I'm using

The server being the ClearML free server? Or an open-source ClearML server you've installed yourself?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

You can use the offline mode and later sync the run with the server

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

what do you mean by "later"? do you mean the training need to end in order to sync it?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedElk70
				
					0
					 × 1

Yes, in offline mode the task writes everything to a local cache, which you can later (when the task finishes) upload to the server - see here: https://clear.ml/docs/latest/docs/guides/set_offline#setting-task-to-offline-mode

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

the problem is that my training are long[a few days] and I want to monitor them while they are running.
Is there a solution for that?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedElk70
				
					0
					 × 1

And the machine running the training can't reach the server?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Is there a solution for that?

Hi DisturbedElk70
Well assuming you mount/sync the "temp" folder of the offline experiment to a storage solution, then have another process (on the other side), syncing these folders, it will work and you will get "real-time" updates 🙂
Offline Folder:
get_cache_dir() / 'offline' / task_id

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sounds perfect!

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedElk70
				
					0
					 × 1

Hi AgitatedDove14 ,
I played around with offline mode for a bit and I see 2 issues:
We would like to sync periodically so that we can see the progress of the training, but if I sync more than once I get a duplication of each line in log (e.g. if I call import_offline_session 3 times with the same session_folder I will get each line in the log 3 times) sometime we resume training - using import_offline_session this is not possible (although it is possible using TaskHandler.report_offline_session(task, session_folder) and Metrics.report_offline_session(task, session_folder) )

  				
Posted 
	3 years ago

					More  		
  Report
		
					StaleButterfly40
				
					0
					 × 1

Hi StaleButterfly40

but if I sync more than once I get a duplication of each line in log

Hmm.. let me check if we can "force" overwriting (it might require you to have a more stateful code for the sync process)

sometime we resume training

How would that work in offline mode? The offline process cannot sync with the backend... Are you saying you would like to get a new capability, "continue-offline-session" ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14
I was thinking of something like reuse_task_name
if set to True- the import function will not create a new task but rather use the task with the name of the offline task (if available).
And in metric+log reporting it would check when the last "event" was and filter out everything before it
How does that sound to you?

  				
Posted 
	3 years ago

					More  		
  Report
		
					StaleButterfly40
				
					0
					 × 1

StaleButterfly40 just making sure I understand, are we trying to solve the "import offline zip file/folder" issue, where we create multiple Tasks (i.e. Task per import)? Or are you suggesting the Actual task (the one running in offline mode) needs support for continue-previous execution ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Just the import part should support it - in offline cache dir it can be 2 separate tasks (or even from 2 different training machines)
e.g. trained on 1 machine in offline mode - machine crashed in the middle but checkpoint was saved. start a new training job from that checkpoint (also in offline mode).
Then I would like to create 1 real task that combines both of these runs

  				
Posted 
	3 years ago

					More  		
  Report
		
					StaleButterfly40
				
					0
					 × 1

Write your answer

1K Views

17 Answers

3 years ago

2 years ago