![Profile picture](https://clearml-web-assets.s3.amazonaws.com/scoold/avatars/DilapidatedDucks58.png)
Reputation
Badges 1
186 × Eureka!we’re using latest ClearML server and client version (1.2.0)
maybe db somehow got corrupted ot smth like this? I'm clueless
some of the POST requests "tasks.get_all_ex" fail as far as I can see
I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data
we do log a lot of the different metrics, maybe this can be part of the problem
btw, are there any examples of exporting metrics using Python client? I could only find last_metrics attribute of the task
we already have cleanup service set up and running, so we should be good from now on
nope, the only changes to config that we made are adding web-auth and non-responsive tasks watchdog
just in case, this warning disappeared after I https://stackoverflow.com/questions/49638699/docker-compose-restart-connection-pool-full
great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
I'll get back to you with the logs when the problem occurs again
oh wow, I didn't see delete_artifacts_and_models option
I guess we'll have to manually find old artifacts that are related to already deleted tasks
hmmm allegroai/trains:latest whatever it is
btw, there are "[2020-09-02 15:15:40,331] [9] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch" in the apiserver logs again
I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
do you have any idea why cleanup task keeps failing then (it used to work before the update)
yes. we upload artifacts to Yandex.Cloud S3 using ClearML. we set " s3://storage.yandexcloud.net/clearml-models " as output uri parameter and add this section to the config:{
host: "
http://storage.yandexcloud.net "
key: "KEY"
secret:"SECRET_KEY",
secure: true
}
this works like a charm. but download button in UI is not working
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration
but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
maybe I should use explicit reporting instead of Tensorboard
thank you, I'll let you know if setting it to zero worked
it’s a pretty standard pytorch train/eval loop, using pytorch dataloader and https://docs.monai.io/en/stable/_modules/monai/data/dataset.html