Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, Is It Possible To Sync Expiriment Using S3 Or Gs? I Loved To Have A Look At The Some Documentation. We Want To Sync The Training While They Are Running[Not Just When They Are Finished] Thanks,

Hi,

Is it possible to sync expiriment using s3 or gs?
I loved to have a look at the some documentation.

We want to sync the training while they are running[not just when they are finished]

Thanks,

StaleButterfly40 FYI

  
  
Posted 2 years ago
Votes Newest

Answers 17


DisturbedElk70 Hi 🙂

Can you elaborate?

  
  
Posted 2 years ago

We try to sync training jobs from GCP to AWS.
We don't have direct connection from the training instance - hence we need to sync it back to AWS using third party.

  
  
Posted 2 years ago

Hi DisturbedElk70 , I'm not sure I understand what you mean by sync - do you mean store all models/checkpoints to S3?

  
  
Posted 2 years ago

Nope.

I want to use clearml to manage my expiriments, but I have no access to the server from the instance I'm using.

  
  
Posted 2 years ago

I have no access to the server from the instance I'm using

The server being the ClearML free server? Or an open-source ClearML server you've installed yourself?

  
  
Posted 2 years ago

You can use the offline mode and later sync the run with the server

  
  
Posted 2 years ago

what do you mean by "later"? do you mean the training need to end in order to sync it?

  
  
Posted 2 years ago

Yes, in offline mode the task writes everything to a local cache, which you can later (when the task finishes) upload to the server - see here: https://clear.ml/docs/latest/docs/guides/set_offline#setting-task-to-offline-mode

  
  
Posted 2 years ago

the problem is that my training are long[a few days] and I want to monitor them while they are running.
Is there a solution for that?

  
  
Posted 2 years ago

And the machine running the training can't reach the server?

  
  
Posted 2 years ago

Is there a solution for that?

Hi DisturbedElk70
Well assuming you mount/sync the "temp" folder of the offline experiment to a storage solution, then have another process (on the other side), syncing these folders, it will work and you will get "real-time" updates 🙂
Offline Folder:
get_cache_dir() / 'offline' / task_id

  
  
Posted 2 years ago

Sounds perfect!

  
  
Posted 2 years ago

Hi AgitatedDove14 ,
I played around with offline mode for a bit and I see 2 issues:
We would like to sync periodically so that we can see the progress of the training, but if I sync more than once I get a duplication of each line in log (e.g. if I call import_offline_session 3 times with the same session_folder I will get each line in the log 3 times) sometime we resume training - using import_offline_session this is not possible (although it is possible using TaskHandler.report_offline_session(task, session_folder) and Metrics.report_offline_session(task, session_folder) )

  
  
Posted 2 years ago

Hi StaleButterfly40

but if I sync more than once I get a duplication of each line in log

Hmm.. let me check if we can "force" overwriting (it might require you to have a more stateful code for the sync process)

sometime we resume training

How would that work in offline mode? The offline process cannot sync with the backend... Are you saying you would like to get a new capability, "continue-offline-session" ?

  
  
Posted 2 years ago

AgitatedDove14
I was thinking of something like reuse_task_name
if set to True- the import function will not create a new task but rather use the task with the name of the offline task (if available).
And in metric+log reporting it would check when the last "event" was and filter out everything before it
How does that sound to you?

  
  
Posted 2 years ago

StaleButterfly40 just making sure I understand, are we trying to solve the "import offline zip file/folder" issue, where we create multiple Tasks (i.e. Task per import)? Or are you suggesting the Actual task (the one running in offline mode) needs support for continue-previous execution ?

  
  
Posted 2 years ago

Just the import part should support it - in offline cache dir it can be 2 separate tasks (or even from 2 different training machines)
e.g. trained on 1 machine in offline mode - machine crashed in the middle but checkpoint was saved. start a new training job from that checkpoint (also in offline mode).
Then I would like to create 1 real task that combines both of these runs

  
  
Posted 2 years ago
1K Views
17 Answers
2 years ago
one year ago
Tags