agent.hide_docker_command_env_vars.extra_keys: ["DB_PASSWORD=password"]
like this? or ["DB_PASSWORD", "password"]
it works, but it's not very helpful since everybody can see a secret in logs:
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '-e', 'DB_PASSWORD=password']
any suggestions on how to fix it?
I guess, this could overcomplicate ui, I don't see a good solution yet.
as a quick hack, we can just use separate name (eg "best_val_roc_auc") for all metric values for the current best checkpoint. then we can just add columns with the last value of this metric
this is the artifactory, this is how I install these packages in the Docker image:
pip3 install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
the files are used for training and evaluation (e.g., precomputed pycocotools meta-info). I could theoretically include them in the repo, but some of them might be quite heavy. what do you mean when you say that they get lost? I copy them from the host machine when I build the custom image, so they are i...
right now we can pass github secrets to the clearml agent training containers ( CLEARML_AGENT_GIT_PASS) to install private repos
we need a way to pass secrets to access our database with annotations
yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones
I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!
I've done it many times, using different devices. sometimes it works, sometimes it doesn't
dnk if it's relevant, but I also added a new user to apiserver.conf today
I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers
I'll get back to you with the logs when the problem occurs again
as a sidenote, I am not able to pull the newest release, looks like it's not pushed?
"Error response from daemon: manifest for allegroai/trains:0.14.2 not found"
I assume, temporary fix is to switch to trains-server?
hmmm allegroai/trains:latest whatever it is
not quite. for example, Iām not sure which info is stored in Elastic and which is in MongoDB
I guess I could manually explore different containers and their content š as far as I remember, I had to update Elastic records when we moved to the new cloud provider in order to update model URLs
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers