perhaps it’s happening because it’s an old project that was moved to the new root project?
hard to say, maybe just “related experiments” in experiment info would be enough. I’ll think about it
tags are somewhat fine for this, I guess, but there will be too many of them eventually, and they do not reflect sequential nature of the experiments
runningdocker network prunebefore starting the containers kind of helped. I still see an error when I'm comparing > 20 experiments, but at least trains works okay after that, and there are no connection pool limit errors in the logs
I added the link just in case anyway 😃
also, is there any way to install a repo that we clone as a package. we often use absolute imports and do "pip install -e ." to utilize it
sorry there are so many questions, we just really want to migrate to trains-agent)
btw, are there any examples of exporting metrics using Python client? I could only find last_metrics attribute of the task
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
LOL
wow 😃
I was trying to find how to create a queue using CLI 😃
yeah, that sounds right! thanks, will try
some of the POST requests "tasks.get_all_ex" fail as far as I can see
nope, that's the point, quite often we run experiments separately, but they are related to each other. currently there's no way to see that one experiment is using checkpoint from the previous experiment since we need to manually insert S3 link as a hyperparameter. it would be useful to see these connections. maybe instead of grouping we could see which experiments are using artifacts of this experiment
yeah, we've used pipelines in other scenarios. might be a good fit here. thanks!
we've already restarted everything, so I don't have any logs on hands right now. I'll let you know if we face any problems 😃 slack bot works though! 🎉
this definitely would be a nice addition. number of hyperparameters in our models often goes up to 100
that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU
I change the arguments in Web UI, but it looks like they are not parsed by trains
btw, there are "[2020-09-02 15:15:40,331] [9] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch" in the apiserver logs again
we’re using latest ClearML server and client version (1.2.0)
Requirement already satisfied (use --upgrade to upgrade): celsusutils==0.0.1
I'm not sure it's related to the domain switch since we upgraded to the newest ClearML server version at the same time
okay, so if there’s no workaround atm, should I create a Github issue?