Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I'Ve Been Getting The Following Error When Running Training Code Through An Agent,

Hi,
I've been getting the following error when running training code through an agent,

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

but when i run the code from the same user locally it is working, so it isn't a CUDA problem and it has something to do with the agent.
Kinda stuck so any help is greatly appreciated!

  
  
Posted 9 months ago
Votes Newest

Answers 9


@<1523701295830011904:profile|CluelessFlamingo93> I believe this is basically pip failing to install the correct version. Can you try to set the agent setting of agent.package_manager.pytorch_resolve: direct ?

  
  
Posted 8 months ago

Yes, same one

  
  
Posted 9 months ago

@<1523701087100473344:profile|SuccessfulKoala55> and @<1523701070390366208:profile|CostlyOstrich36> Ok so I found the problem but its weird,
when the agent is setting up the enviorment its installing torch=1.11.0 and not installing the one in the requirements which is torch=1.11.0+cu113,
I've checked the clearml.conf and i do have this flag set:

force_repo_requirements_txt: true

and I have a local whl of torch=1.11.0+cu113 with a path set to its location in the requirements.txt but its not installing the local whl but using a cached one without cuda.
i do know that i have a miss match between the installed cuda (12.0) and the one stated in the requirements(11.3) and i noticed in the log that it says the following:

Torch CUDA 118 index page found

and yet when i run locally Its using my conda env with torch1.11.0+cu113 perfectly,
Can an a agent run with a higher version CUDA run a application with a lower version?
Why when running from the agent its not installing my requirements and caching them into a env?

  
  
Posted 9 months ago

It’s running a agent without docker, we aren’t using docker

  
  
Posted 9 months ago

Is the agent running on the same machine as the original code that didn't get any errors?

  
  
Posted 9 months ago

@<1523701295830011904:profile|CluelessFlamingo93> is this running using the agent's docker mode? are you using some docker container?

  
  
Posted 9 months ago

@<1523701087100473344:profile|SuccessfulKoala55> But when i use this setting it the packages download only from the torch repo and not a local repo correct? or does it use the url-extra-link? and is there a way to cancel the auto cuda detect?

  
  
Posted 8 months ago

When you run the code locally the package is already installed, right?

  
  
Posted 8 months ago

yes it is

  
  
Posted 8 months ago