Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Answered

Hi all, I've successfully run a Task locally, and now I'm trying to clone it and send it to a Queue. It looks like the environment is built successfully, but it hangs here:

Environment setup completed successfully
Starting Task Execution:

Is there any way of figuring out why the remote Task hangs and how would I go about debugging it?

WebApp: 1.15.1-478 • Server: 1.15.1-478 • API: 2.29

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Votes Newest

Answers 46

Okay I have an idea, it could be a lock that another agent/user is holding on the cache folder or similar
Let me check something

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I've added that flag, removed all PL loggers & callbacks and all references to Hydra, but no luck 😞

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hmm, I'm without, no reason why it will get stuck .
Removing all the auto loggers, this can be done with

Task.init(..., auto_connect_frameworks=False)

which would disconnect all the automatic loggers (Hydra etc) off course this is for debugging purposes

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It’s a Dell XE9680 rack server with 8xH100s which is located in our office, running AlmaOS. We have successfully run training jobs on it inside Docker (without ClearML) which work fine (will check with my team if we’ve got something to train without Docker). I’ve also tried different Python versions; 3.9 (Alma default) and 3.11 which you can see in the log above. It’s a really bizarre issue and outside of print statements I’m not really sure where to look.

You mentioned sync argparser & reporting, so I’ll try removing Hydra to rule that out, and other loggers in PL and see from there …

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

My understanding is that on remote execution Task.init is supposed to be a no-op right?

Not really a no-op, it would sync Argpasrer and the like, start background reporting services etc.

This is so odd! literally nothing printed
Can you tell me something about the node "mrl-plswh100:0" ?
is this like a sagemaker node? we have seen things similar where Python threads / subprocesses are not supported and instead of python crashing it just hangs there

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

My understanding is that on remote execution Task.init is supposed to be a no-op right?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> , here's my code with some more prints:

from clearml import Task

print("Before Task.init")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
print("Before task.set_repo")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
print("Before task.set_packages")
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks

print("After import")

I've attached the full log (using RC2). Still getting stuck at Task.init - very weird

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

This is so odd,
could you add prints right after the Task.init?
Also could you verify it still gets stuck with the latest RC

clearml==1.16.3rc2

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> , I reordered the imports:

from clearml import Task

print("Before task")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks


for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_properties(i).name)

And here's the output:

Environment setup completed successfully
Starting Task Execution:
Before task

Still looks like it's getting stuck at Task.init

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

If there was an SSL issue it should log to console right?

correct, also the agent is able to report, so I'm assuming configuration is correct
@<1724960464275771392:profile|DepravedBee82> could you try to put the clearml import + Task .init at the top of your code?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If there was an SSL issue it should log to console right?

ClearML is hosted on an on-prem kube cluster and to get it to log locally I needed to append my company cert to the file located at certifi.where() . Do you think the same needs to be done for the Python installation for the worker?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Ok so my train.py now looks like this:

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks

from clearml import Task

for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_properties(i).name)

print("Before task")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")

print("After task")

And the log looks like this:

Starting Task Execution:
Before import
2024-07-19 09:06:09
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
Before task

So it looks like it's getting stuck at Task.init

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hi @<1724960464275771392:profile|DepravedBee82> , can you perhaps add a simple print at the start of your code before any import?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thanks for the response @<1523701205467926528:profile|AgitatedDove14> ! The code is a small FMNIST test training job written in PyTorch Lightning. On my local job (laptop GPU, Windows) it completes in ~ 5min. On the server (Linux, H100s) it just hangs at Starting Task Execution: . Neither of these are in Docker.

I would expect to see the standard PL progress bars outputted to the console, but since nothing is outputted, so I'm not sure how to go about debugging this. I've attached the full logs for local and remote

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hi @<1724960464275771392:profile|DepravedBee82>
After

Starting Task Execution:

It will literally start the process running your code,
Can you send the full log of the Task? what is the code doing? which system is running the agent (i.e. Windows/Mac/Linux docker etc)

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This is exactly my problem, too, which I described above! If you find any solution, would be glad if you could share. 🙂 Of course, I also share mine when I get one.

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					CumbersomeSealion22
				
					0
					 × 1

Show more results

Write your answer

11K Views

46 Answers

4 months ago

3 months ago