Reputation
Badges 1
186 × Eureka!we often do ablation studies with more than 50 experiments, and it was very convenient to compare their dynamics at the different epochs
Error 12 : Validation error (value ‘['13b46b9325954517ab99381d5f45237d’, ‘bc76c3a7f0f6431b8e064212e9bdd2c0’, ‘5d2a57cd39b94250b8c8f52303ccef92’, ‘e4731ee5b33e41d992d6d3fdb2913045’, ‘698d9231155e41fbb61f8f3faa605727’, ‘2171b190507f40d1be35e222045c58ea’, ‘55c81a5db0ad40bebf72fdcc1b3be2a4’, ‘94fbdbe26ef242d793e18d955cb3de58’, ‘7d8a6c8f2ae246478b39ae5e87def2ad’, ‘141594c146fe495886d477d9a27c465f’, ‘640f87b02dc94a4098a0aba4d855b8f5’]' length is bigger than allowed maximum ‘10’.)
yes. we upload artifacts to Yandex.Cloud S3 using ClearML. we set " s3://storage.yandexcloud.net/clearml-models " as output uri parameter and add this section to the config:{host: " http://storage.yandexcloud.net "key: "KEY"secret:"SECRET_KEY",secure: true}
this works like a charm. but download button in UI is not working
fantastic, everything is working perfectly
thanks guys
is it in documentation somewhere?
we're using the latest version of clearml, clearml agent and clearml server, but we've been using trains/clearml for 2.5 years, so there are some old tasks left, I guess 😃
that's right
for example, there are tasks A, B, C
we run multiple experiments for A, finetune some of them in separate tasks, then choose one or more best checkpoints, run some experiments for task B, choose the best experiment, and finally run task C
so we get a chain of tasks: A - A-ft - B- C
ClearML pipeline doesn't quite work here because we would like to analyze results of each step before starting next task
but it would be great to see predecessors of each experiment in the chain
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
agent.hide_docker_command_env_vars.extra_keys: ["DB_PASSWORD=password"]
like this? or ["DB_PASSWORD", "password"]
it works, but it's not very helpful since everybody can see a secret in logs:
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '-e', 'DB_PASSWORD=password']
we're using os.getenv in the script to get a value for these secrets
any suggestions on how to fix it?
I guess, this could overcomplicate ui, I don't see a good solution yet.
as a quick hack, we can just use separate name (eg "best_val_roc_auc") for all metric values for the current best checkpoint. then we can just add columns with the last value of this metric
this is the artifactory, this is how I install these packages in the Docker image:
pip3 install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
the files are used for training and evaluation (e.g., precomputed pycocotools meta-info). I could theoretically include them in the repo, but some of them might be quite heavy. what do you mean when you say that they get lost? I copy them from the host machine when I build the custom image, so they are i...
right now we can pass github secrets to the clearml agent training containers ( CLEARML_AGENT_GIT_PASS) to install private repos
we need a way to pass secrets to access our database with annotations
yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones
I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!