Hi AgitatedDove14 ,
I played around with offline mode for a bit and I see 2 issues:
We would like to sync periodically so that we can see the progress of the training, but if I sync more than once I get a duplication of each line in log (e.g. if I call import_offline_session
3 times with the same session_folder
I will get each line in the log 3 times) sometime we resume training - using import_offline_session
this is not possible (although it is possible using TaskHandler.report_offline_session(task, session_folder)
and Metrics.report_offline_session(task, session_folder)
)
Hi StaleButterfly40
but if I sync more than once I get a duplication of each line in log
Hmm.. let me check if we can "force" overwriting (it might require you to have a more stateful code for the sync process)
sometime we resume training
How would that work in offline mode? The offline process cannot sync with the backend... Are you saying you would like to get a new capability, "continue-offline-session" ?
Yes, in offline mode the task writes everything to a local cache, which you can later (when the task finishes) upload to the server - see here: https://clear.ml/docs/latest/docs/guides/set_offline#setting-task-to-offline-mode
Hi DisturbedElk70 , I'm not sure I understand what you mean by sync - do you mean store all models/checkpoints to S3?
I have no access to the server from the instance I'm using
The server being the ClearML free server? Or an open-source ClearML server you've installed yourself?
And the machine running the training can't reach the server?
what do you mean by "later"? do you mean the training need to end in order to sync it?
AgitatedDove14
I was thinking of something like reuse_task_name
if set to True- the import function will not create a new task but rather use the task with the name of the offline task (if available).
And in metric+log reporting it would check when the last "event" was and filter out everything before it
How does that sound to you?
StaleButterfly40 just making sure I understand, are we trying to solve the "import offline zip file/folder" issue, where we create multiple Tasks (i.e. Task per import)? Or are you suggesting the Actual task (the one running in offline mode) needs support for continue-previous execution ?
Is there a solution for that?
Hi DisturbedElk70
Well assuming you mount/sync the "temp" folder of the offline experiment to a storage solution, then have another process (on the other side), syncing these folders, it will work and you will get "real-time" updates 🙂
Offline Folder:get_cache_dir() / 'offline' / task_id
Just the import part should support it - in offline cache dir it can be 2 separate tasks (or even from 2 different training machines)
e.g. trained on 1 machine in offline mode - machine crashed in the middle but checkpoint was saved. start a new training job from that checkpoint (also in offline mode).
Then I would like to create 1 real task that combines both of these runs
Nope.
I want to use clearml to manage my expiriments, but I have no access to the server from the instance I'm using.
the problem is that my training are long[a few days] and I want to monitor them while they are running.
Is there a solution for that?
We try to sync training jobs from GCP to AWS.
We don't have direct connection from the training instance - hence we need to sync it back to AWS using third party.
You can use the offline mode and later sync the run with the server