I change the arguments in Web UI, but it looks like they are not parsed by trains
I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data
same here, changing arguments in the Args section of Hyperparameters doesn’t work, training script starts with the default values.
trains 0.16.0
trains-agent 0.16.0
trains-server 0.16.0
I updated the version in the Installed packages section before starting the experiment
this definitely would be a nice addition. number of hyperparameters in our models often goes up to 100
I guess, this could overcomplicate ui, I don't see a good solution yet.
as a quick hack, we can just use separate name (eg "best_val_roc_auc") for all metric values for the current best checkpoint. then we can just add columns with the last value of this metric
runningdocker network prune
before starting the containers kind of helped. I still see an error when I'm comparing > 20 experiments, but at least trains works okay after that, and there are no connection pool limit errors in the logs
I don’t connect anything explicitly, I’m using argparse, it used to work before the update
btw, there are "[2020-09-02 15:15:40,331] [9] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch" in the apiserver logs again
I updated S3 credentials, I'll check if they work later
it doesn't explain inability to delete logged images and texts though
we do log a lot of the different metrics, maybe this can be part of the problem
if you click on the experiment name here, you get 404 because link looks like this:
https://DOMAIN/projects/PROJECT_ID/EXPERIMENT_ID
when it should look like this:
https://DOMAIN/projects/PROJECT_ID/experiments/EXPERIMENT_ID
not necessarily, there are rare cases when container keeps running after experiment is stopped or aborted
will do!
just DMed you a screenshot where you can see a part of the token
parents and children. maybe tags, maybe separate tab or section, idk. I wonder if anyone else is interested in this functionality, for us this is a very common case
I'll get back to you with the logs when the problem occurs again
I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers
copy-pasting entire training command into command line 😃
yeah, server (1.0.0) and client (1.0.1)
problem is solved. I had to replace /opt/trains/data/fileserver to /opt/clearml/data/fileserver in Agent configuration, and replace trains to clearml in Requirements
any suggestions on how to fix it?