Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Folks, Occasionally When I Clone A Job And Enqueue It, Instead Of Being Processed By The Expected Queue, A New Queue (With Some Id That Looks Like An Hash) Is Created Instead, And The Experiment Hangs In A "Pending" State. When This Happens, If I Abor

Hi folks, occasionally when I clone a job and enqueue it, instead of being processed by the expected queue, a new queue (with some id that looks like an hash) is created instead, and the experiment hangs in a "Pending" state.

When this happens, if I Abort the task, reset it and re-enqueue it, often things work. I couldn't properly understand when this happens, but I was wondering if any of you had the same experience?

I am using a self-hosted version of ClearML and the agents are spawned with the K8s Agent Glue helm chart.

  
  
Posted 2 years ago
Votes Newest

Answers 31


cpu, and gpu are the names

  
  
Posted 2 years ago

the experiment is supposed tu run in this queue, but then it hangs in a pending scheduler

  
  
Posted 2 years ago

Is that correct?

  
  
Posted 2 years ago

Also, since you are using the k8s glue agent, can you sent the logs for the k8s glue agent at the time when you enqueue the task?

  
  
Posted 2 years ago

At this point, I see a new queue in the UI:

  
  
Posted 2 years ago

If now I abort the experiment (which is in a pending state and not running), and re-enqueue it again -- no parameters modifications this time...
and I re-enqueue it to the CPU queue, I see that it is sent to the right queue, and after a few seconds the job enters a running state and it completes correctly

  
  
Posted 2 years ago

What I mean was that I think the new queue that is created is actually using the gpu (or cpu?) queue ID as it's name...

  
  
Posted 2 years ago

ping me when you're back 🙂

  
  
Posted 2 years ago

Yeah, I think this gives us some investigation directions... Let me know when you're available and I'll try to think on how to debug this 🙂

  
  
Posted 2 years ago

Thanks, in DM I sent you the conf we use to deploy the agents.

  
  
Posted 2 years ago

I have tried this several time and the behaviour is always the same. It looks like when I modify some hyperparameter, when I enqueue the experiment to one queue, things don't work if I didn't make sure to have previously set the value of k8s-queue to the name of the queue that I want to use. If I don't modify the configuration (e.g. I abort, or reset the job and enqueue it again, or clone and enqueue it without modifying the hyperparameters) then everything works as expected.

  
  
Posted 2 years ago

also, if I clone an experiment on wich I had to set the k8s-queue user property manually to run experiments on a queue, say cpu, and enqueue it to a different queue, say gpu, the property is not updated, and the experiment is enqueued in a queue with a random hash like name. I either have to delete the attribute, or set it to the right queue name, before enqueuing it, to have it run in the right queue

  
  
Posted 2 years ago

And yes these appear in the dropdown menu when I want to enqueue an experiment

  
  
Posted 2 years ago

The workaround that works for me is:
clone the experiment that I run on my laptop in the newly cloned experiment, modify the hyperparameters and configurations to my need in user properties set "k8s-queue" to "cpu" (or the name of queue I want to use) enqueue the experiment to the same queue I just set...
When I do like that in the K8sGlue pod for the cpu queue I can see that it has been correctly picked up:
` No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
Pulling task de29bfd458d44a9491efbda06954d8ae launching on kubernetes cluster
Pushing task de29bfd458d44a9491efbda06954d8ae into temporary pending queue
Kubernetes scheduling task id=de29bfd458d44a9491efbda06954d8ae
kubectl output:

pod/clearml-id-de29bfd458d44a9491efbda06954d8ae created

No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds `

  
  
Posted 2 years ago

The question is who is creating this queue 🙂 - just making sure again, the queue with the strange name is created after you enqueued a task to some existing queue using the UI

  
  
Posted 2 years ago

Ah sorry, I thought what where the names of the queue I created like (in case I used some weird character or stuff like that)

  
  
Posted 2 years ago

image

  
  
Posted 2 years ago

the queues already exist, I created them through the UI.

  
  
Posted 2 years ago

now, I go to experiment, clone an experiment that I previously executed on my laptop. In the newly created experiment, I modify some parameter, and enqueue the experiment in the CPU queue.

  
  
Posted 2 years ago

Exactly that :) if I go in the queue tab, I see a new queue name (that I didn't create),
with a name like "4gh637aqetc"

  
  
Posted 2 years ago

The first thing would be to monitor the apiserver service log and see the requests the server processes - we should identify the call to create the queue and surrounding calls might offer insight as to who requested it 🙂

  
  
Posted 2 years ago

OK, so what did you mean by:

instead of being processed by the expected queue, a new queue (with some id that looks like an hash) is created instead

Because when I asked:

What queue names are created in this scenario?

You said:

cpu, and gpu are the names

🙂

  
  
Posted 2 years ago

Several ideas:
Does the name of the queue correspond to any task ID? or to the ID of the queue you enqueued the task to? Can I see the initial log dump of the k8s glue? it's important to understand how the k8s glue is configured - you can send it to me in a DM 🙂

  
  
Posted 2 years ago

Hi SuccessfulKoala55 I can confirm that the "id-like" queue created by ClearML
actually correspond to the id of queue "k8s_scheduler" (so it looks like that instead of submitting the experiment to the scheduler to be enqueued to the right queue), a new queue whose name corresponds to the id of the k8s_scheduler is created instead.

Hope this helps 🙂

  
  
Posted 2 years ago

Yes, the queue is created when I enqueue the experiment. I took some screenshots, and got the logs (there is an error effectively).
Let me share them with you...

  
  
Posted 2 years ago

If I now reset the experiment, and enqueue the experiment to the gpu queue (but in the experimet, the user-properties configuration for k8s-glue is still set to cpu) the experiment is left in a Pending state... and in the K8sGlue Agent for the gpu queue, I can see a similar error as the one in the cpu agent....

` No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
FATAL ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/lib/python3.6/http/client.py", line 1285, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1331, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1280, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1085, in _send_output
self.send(chunk)
File "/usr/lib/python3.6/http/client.py", line 1006, in send
self.sock.sendall(data)
File "/usr/lib/python3.6/ssl.py", line 975, in sendall
v = self.send(byte_view[count:])
File "/usr/lib/python3.6/ssl.py", line 944, in send
return self._sslobj.write(data)
File "/usr/lib/python3.6/ssl.py", line 642, in write
return self._sslobj.write(data)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 786, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/lib/python3.6/http/client.py", line 1285, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1331, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1280, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1085, in _send_output
self.send(chunk)
File "/usr/lib/python3.6/http/client.py", line 1006, in send
self.sock.sendall(data)
File "/usr/lib/python3.6/ssl.py", line 975, in sendall
v = self.send(byte_view[count:])
File "/usr/lib/python3.6/ssl.py", line 944, in send
return self._sslobj.write(data)
File "/usr/lib/python3.6/ssl.py", line 642, in write
return self._sslobj.write(data)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/worker.py", line 1449, in daemon
gpu_queues=dynamic_gpus,
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/glue/k8s.py", line 774, in run_tasks_loop
level="INFO",
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/events.py", line 99, in send_log_events
return self.send_events(list_events=log_events, session=session)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/events.py", line 58, in send_events
sent_events += send_packet(lines)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/events.py", line 32, in send_packet
'add_batch', data=jsonlines, headers={'Content-type': 'application/json-lines'}, session=session
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/base.py", line 127, in post
return session.post(service=self.service, action=endpoint, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/session.py", line 300, in post
json=json or kwargs)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/session.py", line 308, in _manual_request
json=json or kwargs)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/backend_api/session/session.py", line 363, in send_request
json=json,
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/backend_api/session/session.py", line 291, in _send_request
method, url, headers=headers, auth=auth, data=data, json=json, timeout=timeout)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7 `

  
  
Posted 2 years ago

And these are used where? When you enqueue? Are they in the list when you enqueue? I would assume not since this would mean they have previously existed

  
  
Posted 2 years ago

Before any experiment enqueueing, theare are the queue I have available

  
  
Posted 2 years ago

and in the logs of the K8s Glue I see an exception occurred:

` No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
FATAL ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/lib/python3.6/http/client.py", line 1285, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1331, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1280, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1085, in _send_output
self.send(chunk)
File "/usr/lib/python3.6/http/client.py", line 1006, in send
self.sock.sendall(data)
File "/usr/lib/python3.6/ssl.py", line 975, in sendall
v = self.send(byte_view[count:])
File "/usr/lib/python3.6/ssl.py", line 944, in send
return self._sslobj.write(data)
File "/usr/lib/python3.6/ssl.py", line 642, in write
return self._sslobj.write(data)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 786, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/lib/python3.6/http/client.py", line 1285, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1331, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1280, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1085, in _send_output
self.send(chunk)
File "/usr/lib/python3.6/http/client.py", line 1006, in send
self.sock.sendall(data)
File "/usr/lib/python3.6/ssl.py", line 975, in sendall
v = self.send(byte_view[count:])
File "/usr/lib/python3.6/ssl.py", line 944, in send
return self._sslobj.write(data)
File "/usr/lib/python3.6/ssl.py", line 642, in write
return self._sslobj.write(data)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/worker.py", line 1449, in daemon
gpu_queues=dynamic_gpus,
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/glue/k8s.py", line 774, in run_tasks_loop
level="INFO",
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/events.py", line 99, in send_log_events
return self.send_events(list_events=log_events, session=session)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/events.py", line 58, in send_events
sent_events += send_packet(lines)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/events.py", line 32, in send_packet
'add_batch', data=jsonlines, headers={'Content-type': 'application/json-lines'}, session=session
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/base.py", line 127, in post
return session.post(service=self.service, action=endpoint, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/session.py", line 300, in post
json=json or kwargs)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/session.py", line 308, in _manual_request
json=json or kwargs)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/backend_api/session/session.py", line 363, in send_request
json=json,
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/backend_api/session/session.py", line 291, in _send_request
method, url, headers=headers, auth=auth, data=data, json=json, timeout=timeout)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) `

  
  
Posted 2 years ago

no, there's no task with a name of cpu or gpu... Where can I find the id of the queue to check?2. what do you mean by initial log dumps, the very early row when it's being deployed?

Anyway, sure I can send it to you, but I just turned off my laptop :) and won't be able for a few days.

  
  
Posted 2 years ago
22K Views
31 Answers
2 years ago
7 months ago
Tags