It’s running a agent without docker, we aren’t using docker
@<1523701087100473344:profile|SuccessfulKoala55> and @<1523701070390366208:profile|CostlyOstrich36> Ok so I found the problem but its weird,
when the agent is setting up the enviorment its installing torch=1.11.0 and not installing the one in the requirements which is torch=1.11.0+cu113,
I've checked the clearml.conf and i do have this flag set:
force_repo_requirements_txt: true
and I have a local whl of torch=1.11.0+cu113 with a path set to its location in the requirements.txt but its not installing the local whl but using a cached one without cuda.
i do know that i have a miss match between the installed cuda (12.0) and the one stated in the requirements(11.3) and i noticed in the log that it says the following:
Torch CUDA 118 index page found
and yet when i run locally Its using my conda env with torch1.11.0+cu113 perfectly,
Can an a agent run with a higher version CUDA run a application with a lower version?
Why when running from the agent its not installing my requirements and caching them into a env?
@<1523701295830011904:profile|CluelessFlamingo93> I believe this is basically pip failing to install the correct version. Can you try to set the agent setting of agent.package_manager.pytorch_resolve: direct
?
@<1523701087100473344:profile|SuccessfulKoala55> But when i use this setting it the packages download only from the torch repo and not a local repo correct? or does it use the url-extra-link? and is there a way to cancel the auto cuda detect?
@<1523701295830011904:profile|CluelessFlamingo93> is this running using the agent's docker mode? are you using some docker container?
When you run the code locally the package is already installed, right?
Is the agent running on the same machine as the original code that didn't get any errors?