Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Answered

Hi all, I've successfully run a Task locally, and now I'm trying to clone it and send it to a Queue. It looks like the environment is built successfully, but it hangs here:

Environment setup completed successfully
Starting Task Execution:

Is there any way of figuring out why the remote Task hangs and how would I go about debugging it?

WebApp: 1.15.1-478 • Server: 1.15.1-478 • API: 2.29

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Votes Newest

Answers 46

Can you add before the Task.init

import os
print(os.environ)

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1724960464275771392:profile|DepravedBee82> , can you perhaps add a simple print at the start of your code before any import?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

He confirmed that it’s not inside a container. Trying to figure out why it’s running as root but would it make a difference if it was? Is it better to run the agent from a user profile?

Edit: it might be a container! Just checking now...

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hmm, I'm without, no reason why it will get stuck .
Removing all the auto loggers, this can be done with

Task.init(..., auto_connect_frameworks=False)

which would disconnect all the automatic loggers (Hydra etc) off course this is for debugging purposes

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Nope - confirmed to be running on the OS's Python environment,

okay so bare metal root is definitely not recommended.
I'm not sure how/why it get's stuck though 😞
Any chance you can run the agent as non-root?
Also maybe preferred in docker mode, so it is easier for you to control the environment of the Task

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Although it's still really weird how it was failing silently

totally agree, I think the main issue was the agent had the correct configuration, but the container / env the agent was spinning was missing it,
I'll double check how come it did not print anything

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I've added that flag, removed all PL loggers & callbacks and all references to Hydra, but no luck 😞

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Please let me know what you find 🤞

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> we've now configured the server to have it's own user account to run the agent so it is no longer running as root, but no luck 😞

Before os.environ
environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/home/clearml', 'LOGNAME': 'clearml', 'USER': 'clearml', 'SHELL': '/bin/bash', 'INVOCATION_ID': 'da8e36a03c7348efbb7db360755e92b3', 'JOURNAL_STREAM': '8:244189055', 'SYSTEMD_EXEC_PID': '1970812', 'PYTHONUNBUFFERED': '1', 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID', 'CLEARML_WORKER_ID': 'mrl-plswh100:0', 'TRAINS_WORKER_ID': 'mrl-plswh100:0', 'CLEARML_CONFIG_FILE': '/tmp/.clearml_agent.4ll2u471.cfg', 'TRAINS_CONFIG_FILE': '/tmp/.clearml_agent.4ll2u471.cfg', 'CLEARML_TASK_ID': '4ab4c22b02ed4d1f86ff4fac663828f0', 'TRAINS_TASK_ID': '4ab4c22b02ed4d1f86ff4fac663828f0', 'CLEARML_LOG_LEVEL': 'INFO', 'TRAINS_LOG_LEVEL': 'INFO', 'CLEARML_LOG_TASK_TO_BACKEND': '0', 'TRAINS_LOG_TASK_TO_BACKEND': '0', 'PYTHONPATH': '/home/clearml/.clearml/venvs-builds/3.9/task_repository/ml-queue-test:/home/clearml/.clearml/venvs-builds/3.9/task_repository/ml-queue-test::/usr/lib64/python39.zip:/usr/lib64/python3.9:/usr/lib64/python3.9/lib-dynload:/home/clearml/.clearml/venvs-builds/3.9/lib64/python3.9/site-packages:/home/clearml/.clearml/venvs-builds/3.9/lib/python3.9/site-packages'})
Before Task.init

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

THAT WORKED! 🎉

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

This is exactly my problem, too, which I described above! If you find any solution, would be glad if you could share. 🙂 Of course, I also share mine when I get one.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					CumbersomeSealion22
				
					0
					 × 1

@<1724960464275771392:profile|DepravedBee82> I just realized, the agent is Not running in docker mode, correct? (i.e. venv mode)
If this is the case how come it is running as root? (could it be is is running inside a container? how was that container spinned?)

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

None

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Ok so my train.py now looks like this:

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks

from clearml import Task

for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_properties(i).name)

print("Before task")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")

print("After task")

And the log looks like this:

Starting Task Execution:
Before import
2024-07-19 09:06:09
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
Before task

So it looks like it's getting stuck at Task.init

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Will try non-root and get back to you. I’m also trying to reproduce on a different machine too

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

My money is on the Redis container although comparing the logs between Kube & Docker Desktop, nothing looks out of the ordinary...

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Show more results

Write your answer

12K Views

46 Answers

5 months ago

4 months ago