Reputation
Badges 1
10 × Eureka!The autoscaling ignores the outliers.
e.g. when starting training loss is high (10) but quickly does (<1) if I plot the scalar I will not be able to see the changes in loss too well because the graph is on a large range (0-10)
If I ignore the outliers I will get a scale of 0-1
it gets stuck when comparing a 2 experiments even if one of them does not have the artifacts
I deleted the artifacts and it seems to work now
Hi AgitatedDove14 ,
I played around with offline mode for a bit and I see 2 issues:
We would like to sync periodically so that we can see the progress of the training, but if I sync more than once I get a duplication of each line in log (e.g. if I call import_offline_session
3 times with the same session_folder
I will get each line in the log 3 times) sometime we resume training - using import_offline_session
this is not possible (although it is possible using ` TaskHandl...
I deleted all the artifacts - so currently don't have an example..
I think the previews should be loaded lazily so something like this does not happen
Strange, I guess my toy example is not exactly what was happening originally..
If I manage to create a good toy example I will add it..
I want everything to appear in the same experiment (e.g scalar metrics)
CostlyOstrich36 Thanks!
But it seems like this only works if I am running both times from the same machine because clearml is not checking if task exists in server - it is checking if it is in cache_dir
I can read them programmatically using tensorboard and the log the using clearml logger, but was hoping to avoid that
AgitatedDove14
I was thinking of something like reuse_task_name
if set to True- the import function will not create a new task but rather use the task with the name of the offline task (if available).
And in metric+log reporting it would check when the last "event" was and filter out everything before it
How does that sound to you?
no plots, only a couple of scalar metrics
there are a large number of artifacts
Just the import part should support it - in offline cache dir it can be 2 separate tasks (or even from 2 different training machines)
e.g. trained on 1 machine in offline mode - machine crashed in the middle but checkpoint was saved. start a new training job from that checkpoint (also in offline mode).
Then I would like to create 1 real task that combines both of these runs
Thanks, I'll try it out
If it interests you this seems to worklast_task = Task.get_task(project_name="playground-sandbox", task_name='foo2') task = Task.init(project_name="playground-sandbox", task_name='foo2', continue_last_task=last_task.id if last_task else None)
I want to set use_credentials_chain
to true, but do not want to change the config file because I am running in cloud and do not want to have to download it each time I run
i.e. hoping there already is a tool that does this
This only happens when I "continue" a task
the counter gets reset..
My current workaround is overridingĀ UploadEvent._get_metric_count
by adding an offset to _counter