Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Answered

Hi all, I've successfully run a Task locally, and now I'm trying to clone it and send it to a Queue. It looks like the environment is built successfully, but it hangs here:

Environment setup completed successfully
Starting Task Execution:

Is there any way of figuring out why the remote Task hangs and how would I go about debugging it?

WebApp: 1.15.1-478 • Server: 1.15.1-478 • API: 2.29

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Votes Newest

Answers 46

My money is on the Redis container although comparing the logs between Kube & Docker Desktop, nothing looks out of the ordinary...

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Sorry, on the remote machine (i.e. enqueue it and let the agent run it), this will also log the print 🙂

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It’s a Dell XE9680 rack server with 8xH100s which is located in our office, running AlmaOS. We have successfully run training jobs on it inside Docker (without ClearML) which work fine (will check with my team if we’ve got something to train without Docker). I’ve also tried different Python versions; 3.9 (Alma default) and 3.11 which you can see in the log above. It’s a really bizarre issue and outside of print statements I’m not really sure where to look.

You mentioned sync argparser & reporting, so I’ll try removing Hydra to rule that out, and other loggers in PL and see from there …

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

This is exactly my problem, too, which I described above! If you find any solution, would be glad if you could share. 🙂 Of course, I also share mine when I get one.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CumbersomeSealion22
				
					0
					 × 1

I managed to set up my (Windows) laptop as a worker and reproduce the issue. Would that suggest an issue with ClearML server?

Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc. work, but if there are any obvious things we should check, let me know and I can ask our DevOps engineer

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

I've added that flag, removed all PL loggers & callbacks and all references to Hydra, but no luck 😞

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Agent on laptop, Server on Kube - Fail

So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube

Yeah that feels like a network config issue...

Is there a verbose setting in the agent that could help us diagnose,

yes running with debug turned on on.
since you managed to reproduce on your latop you can try to run the agent with --debug to test, specifically:

clearml-agent --debug daemon ....

if you are running it in venv mode (which I think the setup) you can also just specify the Task ID and test that (no daemon just execution)

clearml-agent --debug execute --id <task_id_here>

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

confirmed that the change had been added by

Make sure you see them in the Task log in the UI (the agent print it when it starts)

Any insight on how we can reproduce the issue?

Can this be reproducible using a simple script that we can also run?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I think I've found a clue after running with debug:

Before Task.init
Retrying (Retry(total=239, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
Retrying (Retry(total=238, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:20:07
Retrying (Retry(total=237, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:20:12
Retrying (Retry(total=236, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:20:33
Retrying (Retry(total=235, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:21:03
Retrying (Retry(total=234, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:22:08
Retrying (Retry(total=233, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login

So my current theory is that the env the agent is building doesn't have the corporate TLS/SSL certificates. It's weird how it was failing silently without the --debug flag though...

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> we've now configured the server to have it's own user account to run the agent so it is no longer running as root, but no luck 😞

Before os.environ
environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/home/clearml', 'LOGNAME': 'clearml', 'USER': 'clearml', 'SHELL': '/bin/bash', 'INVOCATION_ID': 'da8e36a03c7348efbb7db360755e92b3', 'JOURNAL_STREAM': '8:244189055', 'SYSTEMD_EXEC_PID': '1970812', 'PYTHONUNBUFFERED': '1', 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID', 'CLEARML_WORKER_ID': 'mrl-plswh100:0', 'TRAINS_WORKER_ID': 'mrl-plswh100:0', 'CLEARML_CONFIG_FILE': '/tmp/.clearml_agent.4ll2u471.cfg', 'TRAINS_CONFIG_FILE': '/tmp/.clearml_agent.4ll2u471.cfg', 'CLEARML_TASK_ID': '4ab4c22b02ed4d1f86ff4fac663828f0', 'TRAINS_TASK_ID': '4ab4c22b02ed4d1f86ff4fac663828f0', 'CLEARML_LOG_LEVEL': 'INFO', 'TRAINS_LOG_LEVEL': 'INFO', 'CLEARML_LOG_TASK_TO_BACKEND': '0', 'TRAINS_LOG_TASK_TO_BACKEND': '0', 'PYTHONPATH': '/home/clearml/.clearml/venvs-builds/3.9/task_repository/ml-queue-test:/home/clearml/.clearml/venvs-builds/3.9/task_repository/ml-queue-test::/usr/lib64/python39.zip:/usr/lib64/python3.9:/usr/lib64/python3.9/lib-dynload:/home/clearml/.clearml/venvs-builds/3.9/lib64/python3.9/site-packages:/home/clearml/.clearml/venvs-builds/3.9/lib/python3.9/site-packages'})
Before Task.init

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

I just ran with this in my local task, and all the env vars were printed to console, but in ClearML they are not in the console log. Presumably that's because it's printed before ClearML is logging?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Although it's still really weird how it was failing silently

totally agree, I think the main issue was the agent had the correct configuration, but the container / env the agent was spinning was missing it,
I'll double check how come it did not print anything

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hmm no change after adding that unfortunately (confirmed that the change had been added by clearml-agent config ) 😞

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc.

The only thing that I can think of is that something is not right the the load balancer on the server so maybe some requests coming from an instance on the cluster are blocked ...
Hmm, saying that aloud that actually could be?! Try to add the following line to the end of the clearml.conf on the machine running the agent:

api.http.default_method: "put"

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1724960464275771392:profile|DepravedBee82> I just realized, the agent is Not running in docker mode, correct? (i.e. venv mode)
If this is the case how come it is running as root? (could it be is is running inside a container? how was that container spinned?)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I managed to set up my (Windows) laptop as a worker and reproduce the issue.

Any insight on how we can reproduce the issue?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Show more results

Write your answer

109K Views

46 Answers

one year ago