Reputation
Badges 1
186 × Eureka!nope, the only changes to config that we made are adding web-auth and non-responsive tasks watchdog
just in case, this warning disappeared after I https://stackoverflow.com/questions/49638699/docker-compose-restart-connection-pool-full
I'll get back to you with the logs when the problem occurs again
perhaps it’s happening because it’s an old project that was moved to the new root project?
hard to say, maybe just “related experiments” in experiment info would be enough. I’ll think about it
as a sidenote, I am not able to pull the newest release, looks like it's not pushed?
"Error response from daemon: manifest for allegroai/trains:0.14.2 not found"
tags are somewhat fine for this, I guess, but there will be too many of them eventually, and they do not reflect sequential nature of the experiments
yes. we upload artifacts to Yandex.Cloud S3 using ClearML. we set " s3://storage.yandexcloud.net/clearml-models " as output uri parameter and add this section to the config:{host: " http://storage.yandexcloud.net "key: "KEY"secret:"SECRET_KEY",secure: true}
this works like a charm. but download button in UI is not working
runningdocker network prunebefore starting the containers kind of helped. I still see an error when I'm comparing > 20 experiments, but at least trains works okay after that, and there are no connection pool limit errors in the logs
I added the link just in case anyway 😃
also, is there any way to install a repo that we clone as a package. we often use absolute imports and do "pip install -e ." to utilize it
sorry there are so many questions, we just really want to migrate to trains-agent)
btw, are there any examples of exporting metrics using Python client? I could only find last_metrics attribute of the task
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
LOL
wow 😃
I was trying to find how to create a queue using CLI 😃
yeah, that sounds right! thanks, will try
some of the POST requests "tasks.get_all_ex" fail as far as I can see
nope, that's the point, quite often we run experiments separately, but they are related to each other. currently there's no way to see that one experiment is using checkpoint from the previous experiment since we need to manually insert S3 link as a hyperparameter. it would be useful to see these connections. maybe instead of grouping we could see which experiments are using artifacts of this experiment
yeah, we've used pipelines in other scenarios. might be a good fit here. thanks!
we've already restarted everything, so I don't have any logs on hands right now. I'll let you know if we face any problems 😃 slack bot works though! 🎉
this definitely would be a nice addition. number of hyperparameters in our models often goes up to 100
I assume, temporary fix is to switch to trains-server?
that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU