sadly the teammate that had the problem re-ran the experiments so i don't have the taskids but I do have the cpu and gpu usage of the agent that ran the experiment:
Hi @<1523701070390366208:profile|CostlyOstrich36> ,
but how do I configure this if I'm not hosting the clearml server?
where can i find the services.conf file?
Hi @<1523701295830011904:profile|CluelessFlamingo93> part of the server is a service that kills such tasks, I think this is what you're looking for - None
Oh, I misunderstood. You mean you're using app.clear.ml ?
Then these should be by default killed by the ClearML server after a few hours. How long was it stuck?
we had a few experiments that were stuck for a few hours until we noticed that and we also had 1 that was stuck for 2 days (on the weekend). and they weren't auto aborted.