Reputation
Badges 1
186 × Eureka!this would be great. I could just then pass it as a hyperparameter
still no luck, I tried everything =( any updates?
thank you, I'll let you know if setting it to zero worked
perhaps I need to do task.set_initial_iteration(0)?
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration
but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
well okay, it's probably not that weird considering that worker just runs the container
m5.xlarge EC2 instance (4 vCPUs, 16 GB RAM), 100GB disk
weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101
we've already restarted everything, so I don't have any logs on hands right now. I'll let you know if we face any problems π slack bot works though! π
python3 slack_alerts.py --channel trains-alerts --slack_api "OUR_KEY" --include_completed_experiments --include_manual_experiments
new icons are slick, it would be even better if you could upload custom icons for the different projects
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass π
task
=
Task.get_task(task_id
=
args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task
=
task, queue_name
=
task.data.execution.queue)
perhaps itβs happening because itβs an old project that was moved to the new root project?
maybe I should use explicit reporting instead of Tensorboard
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
does this mean that setting initial iteration to 0 should help?
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
okay, so if thereβs no workaround atm, should I create a Github issue?
self-hosted ClearML server 1.2.0
SDK version 1.1.6
I'm so happy to see that this problem has been finally solved!
another stupid question - what is the proper way to delete a worker? so far I've been using pgrep to find the relevant PID π