Reputation
Badges 1
186 × Eureka!I'm not sure it's related to the domain switch since we upgraded to the newest ClearML server version at the same time
for me, increasing shm-size usually helps. what does this RC fix?
btw, there are "[2020-09-02 15:15:40,331] [9] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch" in the apiserver logs again
Error
Failed to get Scalar Charts
btw, are there any examples of exporting metrics using Python client? I could only find last_metrics attribute of the task
any suggestions on how to fix it?
runningdocker network prune
before starting the containers kind of helped. I still see an error when I'm comparing > 20 experiments, but at least trains works okay after that, and there are no connection pool limit errors in the logs
[2020-06-09 16:03:19,851] [8] [ERROR] [trains.mongo.initialize] Failed creating fixed user John Doe: 'key'
maybe db somehow got corrupted ot smth like this? I'm clueless
JIC - trains still works after that, it's just that the new user is not added and hence is not able to login
I've done it many times, using different devices. sometimes it works, sometimes it doesn't
as a sidenote, I am not able to pull the newest release, looks like it's not pushed?
"Error response from daemon: manifest for allegroai/trains:0.14.2 not found"
I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers
dnk if it's relevant, but I also added a new user to apiserver.conf today
I'll get back to you with the logs when the problem occurs again
hmmm allegroai/trains:latest whatever it is
I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!
I assume, temporary fix is to switch to trains-server?
no, I even added the argument to specify tensorboard log_dir to make sure this is not happening
yeah, that sounds right! thanks, will try
yes. we upload artifacts to Yandex.Cloud S3 using ClearML. we set " s3://storage.yandexcloud.net/clearml-models " as output uri parameter and add this section to the config:{
host: "
http://storage.yandexcloud.net "
key: "KEY"
secret:"SECRET_KEY",
secure: true
}
this works like a charm. but download button in UI is not working
yeah, we've used pipelines in other scenarios. might be a good fit here. thanks!
yes, this is the use case, I think we can use smth like Redis for this communication