Reputation
Badges 1
186 × Eureka!I updated S3 credentials, I'll check if they work later
it doesn't explain inability to delete logged images and texts though
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
Error 12 : Validation error (value ‘['13b46b9325954517ab99381d5f45237d’, ‘bc76c3a7f0f6431b8e064212e9bdd2c0’, ‘5d2a57cd39b94250b8c8f52303ccef92’, ‘e4731ee5b33e41d992d6d3fdb2913045’, ‘698d9231155e41fbb61f8f3faa605727’, ‘2171b190507f40d1be35e222045c58ea’, ‘55c81a5db0ad40bebf72fdcc1b3be2a4’, ‘94fbdbe26ef242d793e18d955cb3de58’, ‘7d8a6c8f2ae246478b39ae5e87def2ad’, ‘141594c146fe495886d477d9a27c465f’, ‘640f87b02dc94a4098a0aba4d855b8f5’]' length is bigger than allowed maximum ‘10’.)
thanks! this bug and cloning problem seem to be fixed
in order to use private repositories for our experiments I add agent.git_user and agent.git_pass options to clearml.conf when launching agents
if someone accidentally tries to launch an experiment from non-existing repo, ClearML will print
fatal: repository ' https://username:token@github.com/our_organization/non_existing_repo.git/ ' not found
exposing the real token
is it in documentation somewhere?
[2020-06-09 16:03:19,851] [8] [ERROR] [trains.mongo.initialize] Failed creating fixed user John Doe: 'key'
Requirement already satisfied (use --upgrade to upgrade): celsusutils==0.0.1
we often do ablation studies with more than 50 experiments, and it was very convenient to compare their dynamics at the different epochs
maybe db somehow got corrupted ot smth like this? I'm clueless
on the side note, is there any way to automatically give more meaningful names to the running docker containers?
problem is solved. I had to replace /opt/trains/data/fileserver to /opt/clearml/data/fileserver in Agent configuration, and replace trains to clearml in Requirements
WARNING: You are using pip version 20.1.1; however, version 20.3.3 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
trains_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the TRAINS API server
http://apiserver:8008 ?
http://OUR_IP:8081 http://OUR_IP:8080
http://apiserver:8008
WARNING: You are using pip version 20.1.1; however, version 20.3.3 is available.
`...
yes, this is the use case, I think we can use smth like Redis for this communication
yeah, we've used pipelines in other scenarios. might be a good fit here. thanks!
wow, thanks, just updated our server!
can't seem to find these metrics snapshot plots =) how do I plot one?
{
username: "username"
password: "password"
name: "John Doe"
},
right now we can pass github secrets to the clearml agent training containers ( CLEARML_AGENT_GIT_PASS) to install private repos
we need a way to pass secrets to access our database with annotations
not necessarily, there are rare cases when container keeps running after experiment is stopped or aborted
will do!
yes. we upload artifacts to Yandex.Cloud S3 using ClearML. we set " s3://storage.yandexcloud.net/clearml-models " as output uri parameter and add this section to the config:{
host: "
http://storage.yandexcloud.net "
key: "KEY"
secret:"SECRET_KEY",
secure: true
}
this works like a charm. but download button in UI is not working