Reputation
Badges 1
186 × Eureka!we're using os.getenv in the script to get a value for these secrets
agent.hide_docker_command_env_vars.extra_keys: ["DB_PASSWORD=password"]
like this? or ["DB_PASSWORD", "password"]
more like collapse/expand, I guess. or pipelines that you can compose after running experiments to see that experiments are connected to each other
right now we can pass github secrets to the clearml agent training containers ( CLEARML_AGENT_GIT_PASS) to install private repos
we need a way to pass secrets to access our database with annotations
isn't this parameter related to communication with ClearML Server? I'm trying to make sure that checkpoint will be downloaded from AWS S3 even if there are temporary connection problems
there's https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig parameter in boto3, but I'm not sure if there's an easy way to pass this parameter to StorageManager
not necessarily, there are rare cases when container keeps running after experiment is stopped or aborted
will do!
so max values that I get can be reached at the different epochs
we've already restarted everything, so I don't have any logs on hands right now. I'll let you know if we face any problems 😃 slack bot works though! 🎉
new icons are slick, it would be even better if you could upload custom icons for the different projects
parents and children. maybe tags, maybe separate tab or section, idk. I wonder if anyone else is interested in this functionality, for us this is a very common case
the weird part is that the old job continues running when I recreate the worker and enqueue the new job
that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
I don't think so because max value of each metric is calculated independently of other metrics
tags are somewhat fine for this, I guess, but there will be too many of them eventually, and they do not reflect sequential nature of the experiments
another stupid question - what is the proper way to delete a worker? so far I've been using pgrep to find the relevant PID 😃
is it in documentation somewhere?
it works, but it's not very helpful since everybody can see a secret in logs:
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '-e', 'DB_PASSWORD=password']