Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Answered

Hi all, I've successfully run a Task locally, and now I'm trying to clone it and send it to a Queue. It looks like the environment is built successfully, but it hangs here:

Environment setup completed successfully
Starting Task Execution:

Is there any way of figuring out why the remote Task hangs and how would I go about debugging it?

WebApp: 1.15.1-478 • Server: 1.15.1-478 • API: 2.29

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Votes Newest

Answers 46

None

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Retrying (Retry(total=239, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login

OH that makes sense I'm assuming on your local machine the certificate is installed but not on remote machines / containers
Add the following to your clearml.conf:

api.verify_certificate: false

None

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks Martin - will try that and see what I can find. Really appreciate your patience with this! 🙂

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Looking at the logs in the Kube pods now for anything that looks unusual...

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hi @<1724960464275771392:profile|DepravedBee82> , can you perhaps add a simple print at the start of your code before any import?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

If there was an SSL issue it should log to console right?

ClearML is hosted on an on-prem kube cluster and to get it to log locally I needed to append my company cert to the file located at certifi.where() . Do you think the same needs to be done for the Python installation for the worker?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Okay I have an idea, it could be a lock that another agent/user is holding on the cache folder or similar
Let me check something

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sorry, on the remote machine (i.e. enqueue it and let the agent run it), this will also log the print 🙂

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hmm, I'm without, no reason why it will get stuck .
Removing all the auto loggers, this can be done with

Task.init(..., auto_connect_frameworks=False)

which would disconnect all the automatic loggers (Hydra etc) off course this is for debugging purposes

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

confirmed that the change had been added by

Make sure you see them in the Task log in the UI (the agent print it when it starts)

Any insight on how we can reproduce the issue?

Can this be reproducible using a simple script that we can also run?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If there was an SSL issue it should log to console right?

correct, also the agent is able to report, so I'm assuming configuration is correct
@<1724960464275771392:profile|DepravedBee82> could you try to put the clearml import + Task .init at the top of your code?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Agent on laptop, Server on Kube - Fail

So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube

Yeah that feels like a network config issue...

Is there a verbose setting in the agent that could help us diagnose,

yes running with debug turned on on.
since you managed to reproduce on your latop you can try to run the agent with --debug to test, specifically:

clearml-agent --debug daemon ....

if you are running it in venv mode (which I think the setup) you can also just specify the Task ID and test that (no daemon just execution)

clearml-agent --debug execute --id <task_id_here>

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

My understanding is that on remote execution Task.init is supposed to be a no-op right?

Not really a no-op, it would sync Argpasrer and the like, start background reporting services etc.

This is so odd! literally nothing printed
Can you tell me something about the node "mrl-plswh100:0" ?
is this like a sagemaker node? we have seen things similar where Python threads / subprocesses are not supported and instead of python crashing it just hangs there

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Can you add before the Task.init

import os
print(os.environ)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Although it's still really weird how it was failing silently

totally agree, I think the main issue was the agent had the correct configuration, but the container / env the agent was spinning was missing it,
I'll double check how come it did not print anything

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Can this be reproducible using a simple script that we can also run?

Not really unfortunately - happy to share my code, but I've managed to reproduce this with different codebases.

As a summary of what I've tried:

Agent on the H100 machine, Server on Kube - Fail
Agent on laptop, Server on Kube - Fail
Agent on laptop, Server on Docker Desktop - Pass
So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube rather than an issue with the agents or code. As for which of the 7 containers could be at fault... :man-shrugging: . I'm not seeing anything out of the ordinary in the logs. Is there a verbose setting in the agent that could help us diagnose, i.e. each step of what goes on in Task.init ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Ok so my train.py now looks like this:

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks

from clearml import Task

for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_properties(i).name)

print("Before task")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")

print("After task")

And the log looks like this:

Starting Task Execution:
Before import
2024-07-19 09:06:09
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
Before task

So it looks like it's getting stuck at Task.init

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Nope - confirmed to be running on the OS's Python environment,

okay so bare metal root is definitely not recommended.
I'm not sure how/why it get's stuck though 😞
Any chance you can run the agent as non-root?
Also maybe preferred in docker mode, so it is easier for you to control the environment of the Task

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

THAT WORKED! 🎉

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

@<1724960464275771392:profile|DepravedBee82> I just realized, the agent is Not running in docker mode, correct? (i.e. venv mode)
If this is the case how come it is running as root? (could it be is is running inside a container? how was that container spinned?)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes the agent is running in venv mode afaik. As for why it’s running as root - I’ll ask our engineer …

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Please let me know what you find 🤞

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It’s a Dell XE9680 rack server with 8xH100s which is located in our office, running AlmaOS. We have successfully run training jobs on it inside Docker (without ClearML) which work fine (will check with my team if we’ve got something to train without Docker). I’ve also tried different Python versions; 3.9 (Alma default) and 3.11 which you can see in the log above. It’s a really bizarre issue and outside of print statements I’m not really sure where to look.

You mentioned sync argparser & reporting, so I’ll try removing Hydra to rule that out, and other loggers in PL and see from there …

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

My understanding is that on remote execution Task.init is supposed to be a no-op right?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc.

The only thing that I can think of is that something is not right the the load balancer on the server so maybe some requests coming from an instance on the cluster are blocked ...
Hmm, saying that aloud that actually could be?! Try to add the following line to the end of the clearml.conf on the machine running the agent:

api.http.default_method: "put"

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thank you! Although it's still really weird how it was failing silently - would it be worth changing the logging level for that error somewhere?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

I've added that flag, removed all PL loggers & callbacks and all references to Hydra, but no luck 😞

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Will try non-root and get back to you. I’m also trying to reproduce on a different machine too

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

I managed to set up my (Windows) laptop as a worker and reproduce the issue. Would that suggest an issue with ClearML server?

Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc. work, but if there are any obvious things we should check, let me know and I can ask our DevOps engineer

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> , here's my code with some more prints:

from clearml import Task

print("Before Task.init")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
print("Before task.set_repo")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
print("Before task.set_packages")
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks

print("After import")

I've attached the full log (using RC2). Still getting stuck at Task.init - very weird

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Show more results

Write your answer

135K Views

46 Answers

one year ago