so max values that I get can be reached at the different epochs
yeah, that sounds right! thanks, will try
I donβt connect anything explicitly, Iβm using argparse, it used to work before the update
1 - yes, of course =) but it would be awesome if you could customize the content - to include key metrics and hyperparameters, for example
3 - hooooooraaaay
I'll get back to you with the logs when the problem occurs again
if you click on the experiment name here, you get 404 because link looks like this:
https://DOMAIN/projects/PROJECT_ID/EXPERIMENT_ID
when it should look like this:
https://DOMAIN/projects/PROJECT_ID/experiments/EXPERIMENT_ID
the code that is used for training the model is also inside the image
this definitely would be a nice addition. number of hyperparameters in our models often goes up to 100
not necessarily, command usually stays the same irrespective of the machine
yeah, I am aware of trains-agent, we are planning to start using it soon, but still, copying original training command would be useful
I don't think so because max value of each metric is calculated independently of other metrics
that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)
the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...
no, I even added the argument to specify tensorboard log_dir to make sure this is not happening
copy-pasting entire training command into command line π
it might be that there is not enough space on our SSD, experiments cache a lot of preprocessed data during the first epoch...
nope, same problem even after creating a new experiment from scratch
python3 slack_alerts.py --channel trains-alerts --slack_api "OUR_KEY" --include_completed_experiments --include_manual_experiments
dnk if it's relevant, but I also added a new user to apiserver.conf today
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass π
we're using the latest version of clearml, clearml agent and clearml server, but we've been using trains/clearml for 2.5 years, so there are some old tasks left, I guess π
I'm so happy to see that this problem has been finally solved!
perhaps I need to do task.set_initial_iteration(0)?
does this mean that setting initial iteration to 0 should help?