SuccessfulKoala55 So, we have two problems:
Probably minor one, but strange. We run some number of workers using given compose file, that is attached in .zip. We can do:docker compose -f docker-compose-worker.yaml build docker compose -f docker-compose-worker.yaml up
and in theory there should be 10 agents running, but frequently, not 10 are shown in UI (for example on last run we got 3 of them). When we run htop
, we can see 10 agents in our system. What is even more strange, those agents that are not visible in UI are able to take tasks in optimization.
We give the files to make it possible to replicate the problem. Please note that we are using newest version of docker (it has built-in compose, so in upper commands we do docker compose
instead of docker-compose
- Bigger problem. So we run 800 agents (some of them are visible in UI, some not)
What we observe, is: optimization runs, but after around 30min we get such error:"/opt/miniconda3/envs/clearml/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/opt/miniconda3/envs/clearml/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1767, in _report_daemon self._report_completed_tasks_best_results(set(completed_jobs.keys()), task_logger, title, counter) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1930, in _report_completed_tasks_best_results latest_completed, obj_values = self._get_latest_completed_task_value(completed_jobs, series_name) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1992, in _get_latest_completed_task_value completed_time = datetime.strptime(response.response_data["task"]["completed"].partition("+")[0], File "/opt/miniconda3/envs/clearml/lib/python3.9/_strptime.py", line 568, in _strptime_datetime tt, fraction, gmtoff_fraction = _strptime(data_string, format) File "/opt/miniconda3/envs/clearml/lib/python3.9/_strptime.py", line 349, in _strptime raise ValueError("time data %r does not match format %r" % ValueError: time data '2022-11-21T19:43:44' does not match format '%Y-%m-%dT%H:%M:%S.%f'
And then tasks are taken from queue, but optimization process doesn't see them as completed. The above error looks quite strange and random.