Reputation
Badges 1
186 × Eureka!this is how it looks if I zoom in on the epochs that ran before the crash
this would be great. I could just then pass it as a hyperparameter
still no luck, I tried everything =( any updates?
thank you, I'll let you know if setting it to zero worked
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration
but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
thanks! this bug and cloning problem seem to be fixed
yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones
1 - yes, of course =) but it would be awesome if you could customize the content - to include key metrics and hyperparameters, for example
3 - hooooooraaaay
it will probably screw up my resource monitoring plots, but well, who cares 😃
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass 😃
maybe I should use explicit reporting instead of Tensorboard
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
sorry, my bad, after some manipulations I made it work. I have to manually change HTTP to HTTPS in config file for Web and Files (not API) server after initialization, but besides that it works
it’s a pretty standard pytorch train/eval loop, using pytorch dataloader and https://docs.monai.io/en/stable/_modules/monai/data/dataset.html
we’re using latest ClearML server and client version (1.2.0)
perhaps it’s happening because it’s an old project that was moved to the new root project?
yeah, it works for the new projects and for the old projects that have already had a description
it prints an empty dict
I’m doing Task.init() in the script, maybe it somehow resets connected parameters… but it used to work before, weird
I change the arguments in Web UI, but it looks like they are not parsed by trains
I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!
hmmm allegroai/trains:latest whatever it is
I've done it many times, using different devices. sometimes it works, sometimes it doesn't