
Reputation
Badges 1
186 × Eureka!yeah, I was thinking mainly about AWS. we use force to make sure we are using the correct latest checkpoint, but this increases costs when we are running a lot of experiments
Error 12 : Validation error (value β['13b46b9325954517ab99381d5f45237dβ, βbc76c3a7f0f6431b8e064212e9bdd2c0β, β5d2a57cd39b94250b8c8f52303ccef92β, βe4731ee5b33e41d992d6d3fdb2913045β, β698d9231155e41fbb61f8f3faa605727β, β2171b190507f40d1be35e222045c58eaβ, β55c81a5db0ad40bebf72fdcc1b3be2a4β, β94fbdbe26ef242d793e18d955cb3de58β, β7d8a6c8f2ae246478b39ae5e87def2adβ, β141594c146fe495886d477d9a27c465fβ, β640f87b02dc94a4098a0aba4d855b8f5β]' length is bigger than allowed maximum β10β.)
we often do ablation studies with more than 50 experiments, and it was very convenient to compare their dynamics at the different epochs
we already have cleanup service set up and running, so we should be good from now on
well okay, it's probably not that weird considering that worker just runs the container
I don't think so because max value of each metric is calculated independently of other metrics
so max values that I get can be reached at the different epochs
fantastic, everything is working perfectly
thanks guys
nope, that's the point, quite often we run experiments separately, but they are related to each other. currently there's no way to see that one experiment is using checkpoint from the previous experiment since we need to manually insert S3 link as a hyperparameter. it would be useful to see these connections. maybe instead of grouping we could see which experiments are using artifacts of this experiment
yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones
parents and children. maybe tags, maybe separate tab or section, idk. I wonder if anyone else is interested in this functionality, for us this is a very common case
on the side note, is there any way to automatically give more meaningful names to the running docker containers?
this definitely would be a nice addition. number of hyperparameters in our models often goes up to 100
copy-pasting entire training command into command line π
weβre using latest ClearML server and client version (1.2.0)
it works, but it's not very helpful since everybody can see a secret in logs:
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '-e', 'DB_PASSWORD=password']