Hi, I'Ve Been Getting The Following Error When Running Training Code Through An Agent,

Answered

Hi,
I've been getting the following error when running training code through an agent,

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

but when i run the code from the same user locally it is working, so it isn't a CUDA problem and it has something to do with the agent.
Kinda stuck so any help is greatly appreciated!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CluelessFlamingo93
				
					0
					 × 1

Votes Newest

Answers 9

yes it is

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CluelessFlamingo93
				
					0
					 × 1

When you run the code locally the package is already installed, right?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

@<1523701087100473344:profile|SuccessfulKoala55> But when i use this setting it the packages download only from the torch repo and not a local repo correct? or does it use the url-extra-link? and is there a way to cancel the auto cuda detect?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CluelessFlamingo93
				
					0
					 × 1

@<1523701295830011904:profile|CluelessFlamingo93> I believe this is basically pip failing to install the correct version. Can you try to set the agent setting of agent.package_manager.pytorch_resolve: direct ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

@<1523701087100473344:profile|SuccessfulKoala55> and @<1523701070390366208:profile|CostlyOstrich36> Ok so I found the problem but its weird,
when the agent is setting up the enviorment its installing torch=1.11.0 and not installing the one in the requirements which is torch=1.11.0+cu113,
I've checked the clearml.conf and i do have this flag set:

force_repo_requirements_txt: true

and I have a local whl of torch=1.11.0+cu113 with a path set to its location in the requirements.txt but its not installing the local whl but using a cached one without cuda.
i do know that i have a miss match between the installed cuda (12.0) and the one stated in the requirements(11.3) and i noticed in the log that it says the following:

Torch CUDA 118 index page found

and yet when i run locally Its using my conda env with torch1.11.0+cu113 perfectly,
Can an a agent run with a higher version CUDA run a application with a lower version?
Why when running from the agent its not installing my requirements and caching them into a env?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CluelessFlamingo93
				
					0
					 × 1

It’s running a agent without docker, we aren’t using docker

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CluelessFlamingo93
				
					0
					 × 1

@<1523701295830011904:profile|CluelessFlamingo93> is this running using the agent's docker mode? are you using some docker container?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yes, same one

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CluelessFlamingo93
				
					0
					 × 1

Is the agent running on the same machine as the original code that didn't get any errors?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

2K Views

9 Answers

2 years ago