Hi, We Have Quite An Unusual Issue. We Run Some Agents, We Attach Them To Queue. They Are Doing The Job (They Are Doing Hyperparameter Optimization), However They Are Not Visible Either In:

Answered

Hi, we have quite an unusual issue. We run some agents, we attach them to queue. They are doing the job (they are doing hyperparameter optimization), however they are not visible either in:
UI Using client = APIClient() `` workers_list = client.workers.get_all()

I mean, using those two methods, those agents are not showing, there is less agents shown than it should be in reality.

We run them from script, so looks like then agents are created using script, they are not added to system when they are created (almost) at the same time. Have you ever encountered something like this?

  				
Posted 
	2 years ago

					More  		
  Report
		
					RoundMosquito25
				
					0
					 × 1

Votes Newest

Answers 14

No, we have one server

  				
Posted 
	2 years ago

					More  		
  Report
		
					RoundMosquito25
				
					0
					 × 1

RoundMosquito25 what is the server version?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 So, we have two problems:
Probably minor one, but strange. We run some number of workers using given compose file, that is attached in .zip. We can do:docker compose -f docker-compose-worker.yaml build docker compose -f docker-compose-worker.yaml upand in theory there should be 10 agents running, but frequently, not 10 are shown in UI (for example on last run we got 3 of them). When we run htop , we can see 10 agents in our system. What is even more strange, those agents that are not visible in UI are able to take tasks in optimization.
We give the files to make it possible to replicate the problem. Please note that we are using newest version of docker (it has built-in compose, so in upper commands we do docker compose instead of docker-compose

Bigger problem. So we run 800 agents (some of them are visible in UI, some not)
What we observe, is: optimization runs, but after around 30min we get such error:
"/opt/miniconda3/envs/clearml/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/opt/miniconda3/envs/clearml/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1767, in _report_daemon self._report_completed_tasks_best_results(set(completed_jobs.keys()), task_logger, title, counter) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1930, in _report_completed_tasks_best_results latest_completed, obj_values = self._get_latest_completed_task_value(completed_jobs, series_name) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1992, in _get_latest_completed_task_value completed_time = datetime.strptime(response.response_data["task"]["completed"].partition("+")[0], File "/opt/miniconda3/envs/clearml/lib/python3.9/_strptime.py", line 568, in _strptime_datetime tt, fraction, gmtoff_fraction = _strptime(data_string, format) File "/opt/miniconda3/envs/clearml/lib/python3.9/_strptime.py", line 349, in _strptime raise ValueError("time data %r does not match format %r" % ValueError: time data '2022-11-21T19:43:44' does not match format '%Y-%m-%dT%H:%M:%S.%f'
And then tasks are taken from queue, but optimization process doesn't see them as completed. The above error looks quite strange and random.

  				
Posted 
	2 years ago

					More  		
  Report
		
					RoundMosquito25
				
					0
					 × 1

Hi RoundMosquito25

however they are not visible either in:

But can you see them in the UI?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi RoundMosquito25 , sorry for the hold up, we're looking at that

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi SuccessfulKoala55 . Do you have any updates, especially on #829 which is more critical for us?

  				
Posted 
	2 years ago

					More  		
  Report
		
					RoundMosquito25
				
					0
					 × 1

Hi SuccessfulKoala55

I commented about temporary solution for #828
https://github.com/allegroai/clearml/issues/828

I'll let it up to your decision whether it should be closed

  				
Posted 
	2 years ago

					More  		
  Report
		
					RoundMosquito25
				
					0
					 × 1

🙏

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

done, https://github.com/allegroai/clearml/issues/828 https://github.com/allegroai/clearml/issues/829

  				
Posted 
	2 years ago

					More  		
  Report
		
					RoundMosquito25
				
					0
					 × 1

No. Hovewer, I see some of running agents, but not all

  				
Posted 
	2 years ago

					More  		
  Report
		
					RoundMosquito25
				
					0
					 × 1

We are using docker compose and image: allegroai/clearml:latest (not changed, default one), we restarted the server yesterday. I'll write something more about this problem (how to replicate) soon

  				
Posted 
	2 years ago

					More  		
  Report
		
					RoundMosquito25
				
					0
					 × 1

SuccessfulKoala55 should I make an issue on Github?

  				
Posted 
	2 years ago

					More  		
  Report
		
					RoundMosquito25
				
					0
					 × 1

Please do 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

RoundMosquito25 how is that possible ? could it be they are connected to a different server ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

14 Answers

2 years ago