Reputation
Badges 1
12 × Eureka!Hi @<1523701087100473344:profile|SuccessfulKoala55> - Each worker uses its own venv-builds folder
but again - The system denies my deletion requiest since it deems the venv-builds dir as in use
Does this provide any more context? @<1523701087100473344:profile|SuccessfulKoala55>
Hey!
I see in my agent debug logs that it's constantly dropping the connection with the ClearML Server. I also see my tasks being aborted as User aborted (3) - Just at the point where the (post requirements) venv is added into the local venv cache. Could there be any connection? And if not, does anyone have any clue as to where to continue my debugging?
![image](https://clearml-web-assets.s3.am...
@<1523701087100473344:profile|SuccessfulKoala55> Thanks, I'll check around that!
Hey @<1523701087100473344:profile|SuccessfulKoala55> , 1.8.1
is would definetly seem that way - Although when I look at the error logs, the failure is actually in creating the venv-build folder - (under the task_repository subfolder) -
Repository cloning failed: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\clearml-admin\\.clearml\\venvs-builds\\3.1\\task_repository\\<repo>.git'
huh. It is weird. is there any way to force deletion of it? it seems its still being held be some task and the server has been restarted several times since
Not yet, I tried making it work manually. Might give it a try, thanks!
Hi @<1523701070390366208:profile|CostlyOstrich36> ,
I tried setting up in the clearml-agent helm chart values requests & limits under the k8sGlue configuration in order to force the pods to pick up the gpu from the server, while of course choosing a pod image for the k8s jobs that includes a gpu in it (we're using nvidia/cuda:12.4.1 for testing)
the job is created - but simply can't detect a GPU. attaching the value overrides im using for the chart -
agentk8sglue:
apiServer...