example of the failed experiment
okay, so if thereโs no workaround atm, should I create a Github issue?
python3 slack_alerts.py --channel trains-alerts --slack_api "OUR_KEY" --include_completed_experiments --include_manual_experiments
maybe I should use explicit reporting instead of Tensorboard
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass ๐
not quite. for example, Iโm not sure which info is stored in Elastic and which is in MongoDB
do you have any idea why cleanup task keeps failing then (it used to work before the update)
in order to use private repositories for our experiments I add agent.git_user and agent.git_pass options to clearml.conf when launching agents
if someone accidentally tries to launch an experiment from non-existing repo, ClearML will print
fatal: repository ' https://username:token@github.com/our_organization/non_existing_repo.git/ ' not found
exposing the real token
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
new icons are slick, it would be even better if you could upload custom icons for the different projects
task
=
Task.get_task(task_id
=
args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task
=
task, queue_name
=
task.data.execution.queue)
nope, old clenup task fails with trains_agent: ERROR: Could not find task id=e7725856e9a04271aab846d77d6f7d66 (for host: )
Exception: 'Tasks' object has no attribute 'id
weirdly enough, curl
http://apiserver:8008 from inside the container works
I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers
nice, thanks! I'll check if it solves the issue first thing tomorrow in the morning
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers
weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101
ValueError: Task has no hyperparams section defined
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
awesome news ๐
thank you, I'll let you know if setting it to zero worked