SuccessfulKoala55 should I make an issue on Github?
RoundMosquito25 how is that possible ? could it be they are connected to a different server ?
RoundMosquito25 what is the server version?
Hi SuccessfulKoala55
I commented about temporary solution for #828
https://github.com/allegroai/clearml/issues/828
I'll let it up to your decision whether it should be closed
Hi RoundMosquito25 , sorry for the hold up, we're looking at that
No. Hovewer, I see some of running agents, but not all
We are using docker compose and image: allegroai/clearml:latest
(not changed, default one), we restarted the server yesterday. I'll write something more about this problem (how to replicate) soon
Hi RoundMosquito25
however they are not visible either in:
But can you see them in the UI?
Hi SuccessfulKoala55 . Do you have any updates, especially on #829 which is more critical for us?
SuccessfulKoala55 So, we have two problems:
Probably minor one, but strange. We run some number of workers using given compose file, that is attached in .zip. We can do:docker compose -f docker-compose-worker.yaml build docker compose -f docker-compose-worker.yaml up
and in theory there should be 10 agents running, but frequently, not 10 are shown in UI (for example on last run we got 3 of them). When we run htop
, we can see 10 agents in our system. What is even more strange, those agents that are not visible in UI are able to take tasks in optimization.
We give the files to make it possible to replicate the problem. Please note that we are using newest version of docker (it has built-in compose, so in upper commands we do docker compose
instead of docker-compose
- Bigger problem. So we run 800 agents (some of them are visible in UI, some not)
What we observe, is: optimization runs, but after around 30min we get such error:"/opt/miniconda3/envs/clearml/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/opt/miniconda3/envs/clearml/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1767, in _report_daemon self._report_completed_tasks_best_results(set(completed_jobs.keys()), task_logger, title, counter) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1930, in _report_completed_tasks_best_results latest_completed, obj_values = self._get_latest_completed_task_value(completed_jobs, series_name) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1992, in _get_latest_completed_task_value completed_time = datetime.strptime(response.response_data["task"]["completed"].partition("+")[0], File "/opt/miniconda3/envs/clearml/lib/python3.9/_strptime.py", line 568, in _strptime_datetime tt, fraction, gmtoff_fraction = _strptime(data_string, format) File "/opt/miniconda3/envs/clearml/lib/python3.9/_strptime.py", line 349, in _strptime raise ValueError("time data %r does not match format %r" % ValueError: time data '2022-11-21T19:43:44' does not match format '%Y-%m-%dT%H:%M:%S.%f'
And then tasks are taken from queue, but optimization process doesn't see them as completed. The above error looks quite strange and random.