Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, We Have Quite An Unusual Issue. We Run Some Agents, We Attach Them To Queue. They Are Doing The Job (They Are Doing Hyperparameter Optimization), However They Are Not Visible Either In:

Hi, we have quite an unusual issue. We run some agents, we attach them to queue. They are doing the job (they are doing hyperparameter optimization), however they are not visible either in:
UI Using client = APIClient() `` workers_list = client.workers.get_all()

I mean, using those two methods, those agents are not showing, there is less agents shown than it should be in reality.

We run them from script, so looks like then agents are created using script, they are not added to system when they are created (almost) at the same time. Have you ever encountered something like this?

  
  
Posted 2 years ago
Votes Newest

Answers 14


Hi RoundMosquito25

however they are not visible either in:

But can you see them in the UI?

  
  
Posted 2 years ago

No. Hovewer, I see some of running agents, but not all

  
  
Posted 2 years ago

RoundMosquito25 how is that possible ? could it be they are connected to a different server ?

  
  
Posted 2 years ago

No, we have one server

  
  
Posted 2 years ago

RoundMosquito25 what is the server version?

  
  
Posted 2 years ago

We are using docker compose and image: allegroai/clearml:latest (not changed, default one), we restarted the server yesterday. I'll write something more about this problem (how to replicate) soon

  
  
Posted 2 years ago

SuccessfulKoala55 So, we have two problems:
Probably minor one, but strange. We run some number of workers using given compose file, that is attached in .zip. We can do:docker compose -f docker-compose-worker.yaml build docker compose -f docker-compose-worker.yaml upand in theory there should be 10 agents running, but frequently, not 10 are shown in UI (for example on last run we got 3 of them). When we run htop , we can see 10 agents in our system. What is even more strange, those agents that are not visible in UI are able to take tasks in optimization.
We give the files to make it possible to replicate the problem. Please note that we are using newest version of docker (it has built-in compose, so in upper commands we do docker compose instead of docker-compose

  1. Bigger problem. So we run 800 agents (some of them are visible in UI, some not)
    What we observe, is: optimization runs, but after around 30min we get such error:
    "/opt/miniconda3/envs/clearml/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/opt/miniconda3/envs/clearml/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1767, in _report_daemon self._report_completed_tasks_best_results(set(completed_jobs.keys()), task_logger, title, counter) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1930, in _report_completed_tasks_best_results latest_completed, obj_values = self._get_latest_completed_task_value(completed_jobs, series_name) File "/opt/miniconda3/envs/clearml/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1992, in _get_latest_completed_task_value completed_time = datetime.strptime(response.response_data["task"]["completed"].partition("+")[0], File "/opt/miniconda3/envs/clearml/lib/python3.9/_strptime.py", line 568, in _strptime_datetime tt, fraction, gmtoff_fraction = _strptime(data_string, format) File "/opt/miniconda3/envs/clearml/lib/python3.9/_strptime.py", line 349, in _strptime raise ValueError("time data %r does not match format %r" % ValueError: time data '2022-11-21T19:43:44' does not match format '%Y-%m-%dT%H:%M:%S.%f'
    And then tasks are taken from queue, but optimization process doesn't see them as completed. The above error looks quite strange and random.
  
  
Posted 2 years ago

SuccessfulKoala55 should I make an issue on Github?

  
  
Posted 2 years ago

Please do 🙂

  
  
Posted 2 years ago

🙏

  
  
Posted 2 years ago

Hi SuccessfulKoala55 . Do you have any updates, especially on #829 which is more critical for us?

  
  
Posted one year ago

Hi RoundMosquito25 , sorry for the hold up, we're looking at that

  
  
Posted one year ago

Hi SuccessfulKoala55

I commented about temporary solution for #828
https://github.com/allegroai/clearml/issues/828

I'll let it up to your decision whether it should be closed

  
  
Posted one year ago
1K Views
14 Answers
2 years ago
one year ago
Tags
Similar posts