Reputation
Badges 1
186 × Eureka!we have a baremetal server with ClearML agents, and sometimes there are hanging containers or containers that consume too much RAM. unless I explicitly add container name in container arguments, it will have a random name, which is not very convenient. it would be great if we could set default container name for each experiment (e.g., experiment id)
that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU
yeah, that sounds right! thanks, will try
I donβt connect anything explicitly, Iβm using argparse, it used to work before the update
1 - yes, of course =) but it would be awesome if you could customize the content - to include key metrics and hyperparameters, for example
3 - hooooooraaaay
I'll get back to you with the logs when the problem occurs again
if you click on the experiment name here, you get 404 because link looks like this:
https://DOMAIN/projects/PROJECT_ID/EXPERIMENT_ID
when it should look like this:
https://DOMAIN/projects/PROJECT_ID/experiments/EXPERIMENT_ID
it will probably screw up my resource monitoring plots, but well, who cares π
the code that is used for training the model is also inside the image
tags are somewhat fine for this, I guess, but there will be too many of them eventually, and they do not reflect sequential nature of the experiments
this definitely would be a nice addition. number of hyperparameters in our models often goes up to 100
not necessarily, command usually stays the same irrespective of the machine
yeah, I am aware of trains-agent, we are planning to start using it soon, but still, copying original training command would be useful
still no luck, I tried everything =( any updates?
I don't think so because max value of each metric is calculated independently of other metrics
yeah, it works for the new projects and for the old projects that have already had a description
Error 12 : Validation error (value β['13b46b9325954517ab99381d5f45237dβ, βbc76c3a7f0f6431b8e064212e9bdd2c0β, β5d2a57cd39b94250b8c8f52303ccef92β, βe4731ee5b33e41d992d6d3fdb2913045β, β698d9231155e41fbb61f8f3faa605727β, β2171b190507f40d1be35e222045c58eaβ, β55c81a5db0ad40bebf72fdcc1b3be2a4β, β94fbdbe26ef242d793e18d955cb3de58β, β7d8a6c8f2ae246478b39ae5e87def2adβ, β141594c146fe495886d477d9a27c465fβ, β640f87b02dc94a4098a0aba4d855b8f5β]' length is bigger than allowed maximum β10β.)
that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)
the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...
no, I even added the argument to specify tensorboard log_dir to make sure this is not happening
copy-pasting entire training command into command line π