Reputation
Badges 1
186 × Eureka!this is how it looks if I zoom in on the epochs that ran before the crash
task
=
Task.get_task(task_id
=
args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task
=
task, queue_name
=
task.data.execution.queue)
thank you, I'll let you know if setting it to zero worked
this would be great. I could just then pass it as a hyperparameter
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass ๐
yeah, server (1.0.0) and client (1.0.1)
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
perhaps I need to do task.set_initial_iteration(0)?
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
as a sidenote, I am not able to pull the newest release, looks like it's not pushed?
"Error response from daemon: manifest for allegroai/trains:0.14.2 not found"
I've done it many times, using different devices. sometimes it works, sometimes it doesn't
I assume, temporary fix is to switch to trains-server?
thanks! this bug and cloning problem seem to be fixed
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration
but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?
okay, so if thereโs no workaround atm, should I create a Github issue?
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
maybe I should use explicit reporting instead of Tensorboard
still no luck, I tried everything =( any updates?
dnk if it's relevant, but I also added a new user to apiserver.conf today
I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!
I don't think so because max value of each metric is calculated independently of other metrics
so max values that I get can be reached at the different epochs
just DMed you a screenshot where you can see a part of the token
in order to use private repositories for our experiments I add agent.git_user and agent.git_pass options to clearml.conf when launching agents
if someone accidentally tries to launch an experiment from non-existing repo, ClearML will print
fatal: repository ' https://username:token@github.com/our_organization/non_existing_repo.git/ ' not found
exposing the real token
does this mean that setting initial iteration to 0 should help?