Reputation
Badges 1
186 × Eureka![2020-06-09 16:03:19,851] [8] [ERROR] [trains.mongo.initialize] Failed creating fixed user John Doe: 'key'
{
username: "username"
password: "password"
name: "John Doe"
},
maybe db somehow got corrupted ot smth like this? I'm clueless
nice! exactly what I need, thank you!
yeah, I am aware of trains-agent, we are planning to start using it soon, but still, copying original training command would be useful
this definitely would be a nice addition. number of hyperparameters in our models often goes up to 100
not necessarily, command usually stays the same irrespective of the machine
copy-pasting entire training command into command line 😃
problem is solved. I had to replace /opt/trains/data/fileserver to /opt/clearml/data/fileserver in Agent configuration, and replace trains to clearml in Requirements
we're using the latest version of clearml, clearml agent and clearml server, but we've been using trains/clearml for 2.5 years, so there are some old tasks left, I guess 😃
two more questions about cleanup if you don't mind:
what if for some old tasks I get WARNING:root:Could not delete Task ID=a0908784a2a942c3812f947ec1f32c9f, 'Task' object has no attribute 'delete'? What's the best way of cleaning them? What is the recommended way of providing S3 credentials to cleanup task?
what if cleanup service is launched using ClearML-Agent Services container (part of the ClearML server)? adding clearml.conf to the home directory doesn't help
oh wow, I didn't see delete_artifacts_and_models option
I guess we'll have to manually find old artifacts that are related to already deleted tasks
we already have cleanup service set up and running, so we should be good from now on
btw, are there any examples of exporting metrics using Python client? I could only find last_metrics attribute of the task
some of the POST requests "tasks.get_all_ex" fail as far as I can see
runningdocker network prune
before starting the containers kind of helped. I still see an error when I'm comparing > 20 experiments, but at least trains works okay after that, and there are no connection pool limit errors in the logs
Error
Failed to get Scalar Charts
nope, the only changes to config that we made are adding web-auth and non-responsive tasks watchdog
just in case, this warning disappeared after I https://stackoverflow.com/questions/49638699/docker-compose-restart-connection-pool-full
btw, there are "[2020-09-02 15:15:40,331] [9] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch" in the apiserver logs again
m5.xlarge EC2 instance (4 vCPUs, 16 GB RAM), 100GB disk