Hi, I'Ve Been Getting The Following Error When Running Training Code Through An Agent,

Unanswered

@<1523701087100473344:profile|SuccessfulKoala55> and @<1523701070390366208:profile|CostlyOstrich36> Ok so I found the problem but its weird,
when the agent is setting up the enviorment its installing torch=1.11.0 and not installing the one in the requirements which is torch=1.11.0+cu113,
I've checked the clearml.conf and i do have this flag set:

force_repo_requirements_txt: true

and I have a local whl of torch=1.11.0+cu113 with a path set to its location in the requirements.txt but its not installing the local whl but using a cached one without cuda.
i do know that i have a miss match between the installed cuda (12.0) and the one stated in the requirements(11.3) and i noticed in the log that it says the following:

Torch CUDA 118 index page found

and yet when i run locally Its using my conda env with torch1.11.0+cu113 perfectly,
Can an a agent run with a higher version CUDA run a application with a lower version?
Why when running from the agent its not installing my requirements and caching them into a env?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CluelessFlamingo93
				
					0
					 × 1

149 Views

0 Answers

one year ago