It’s running a agent without docker, we aren’t using docker
@<1523701087100473344:profile|SuccessfulKoala55> and @<1523701070390366208:profile|CostlyOstrich36> Ok so I found the problem but its weird,
when the agent is setting up the enviorment its installing torch=1.11.0 and not installing the one in the requirements which is torch=1.11.0+cu113,
I've checked the clearml.conf and i do have this flag set:
force_repo_requirements_txt: true
and I have a local whl of torch=1.11.0+cu113 with a path set to its location in the requirements.txt but its not installing the local whl but using a cached one without cuda.
i do know that i have a miss match between the installed cuda (12.0) and the one stated in the requirements(11.3) and i noticed in the log that it says the following:
Torch CUDA 118 index page found
and yet when i run locally Its using my conda env with torch1.11.0+cu113 perfectly,
Can an a agent run with a higher version CUDA run a application with a lower version?
Why when running from the agent its not installing my requirements and caching them into a env?
When you run the code locally the package is already installed, right?
@<1523701087100473344:profile|SuccessfulKoala55> But when i use this setting it the packages download only from the torch repo and not a local repo correct? or does it use the url-extra-link? and is there a way to cancel the auto cuda detect?
Is the agent running on the same machine as the original code that didn't get any errors?
@<1523701295830011904:profile|CluelessFlamingo93> I believe this is basically pip failing to install the correct version. Can you try to set the agent setting of agent.package_manager.pytorch_resolve: direct
?
@<1523701295830011904:profile|CluelessFlamingo93> is this running using the agent's docker mode? are you using some docker container?