Reputation
Badges 1
10 × Eureka!Just the import part should support it - in offline cache dir it can be 2 separate tasks (or even from 2 different training machines)
e.g. trained on 1 machine in offline mode - machine crashed in the middle but checkpoint was saved. start a new training job from that checkpoint (also in offline mode).
Then I would like to create 1 real task that combines both of these runs
Strange, I guess my toy example is not exactly what was happening originally..
If I manage to create a good toy example I will add it..
I want to set use_credentials_chain
to true, but do not want to change the config file because I am running in cloud and do not want to have to download it each time I run
Hi AgitatedDove14 ,
I played around with offline mode for a bit and I see 2 issues:
We would like to sync periodically so that we can see the progress of the training, but if I sync more than once I get a duplication of each line in log (e.g. if I call import_offline_session
3 times with the same session_folder
I will get each line in the log 3 times) sometime we resume training - using import_offline_session
this is not possible (although it is possible using ` TaskHandl...
I can read them programmatically using tensorboard and the log the using clearml logger, but was hoping to avoid that
i.e. hoping there already is a tool that does this
If it interests you this seems to worklast_task = Task.get_task(project_name="playground-sandbox", task_name='foo2') task = Task.init(project_name="playground-sandbox", task_name='foo2', continue_last_task=last_task.id if last_task else None)
CostlyOstrich36 Thanks!
But it seems like this only works if I am running both times from the same machine because clearml is not checking if task exists in server - it is checking if it is in cache_dir
Thanks, I'll try it out
AgitatedDove14
I was thinking of something like reuse_task_name
if set to True- the import function will not create a new task but rather use the task with the name of the offline task (if available).
And in metric+log reporting it would check when the last "event" was and filter out everything before it
How does that sound to you?
The autoscaling ignores the outliers.
e.g. when starting training loss is high (10) but quickly does (<1) if I plot the scalar I will not be able to see the changes in loss too well because the graph is on a large range (0-10)
If I ignore the outliers I will get a scale of 0-1
I want everything to appear in the same experiment (e.g scalar metrics)
no plots, only a couple of scalar metrics
there are a large number of artifacts
it gets stuck when comparing a 2 experiments even if one of them does not have the artifacts
I deleted the artifacts and it seems to work now
I deleted all the artifacts - so currently don't have an example..
I think the previews should be loaded lazily so something like this does not happen
This only happens when I "continue" a task
the counter gets reset..
My current workaround is overridingĀ UploadEvent._get_metric_count
by adding an offset to _counter