Unanswered
Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:
Can this be reproducible using a simple script that we can also run?
Not really unfortunately - happy to share my code, but I've managed to reproduce this with different codebases.
As a summary of what I've tried:
- Agent on the H100 machine, Server on Kube - Fail
- Agent on laptop, Server on Kube - Fail
- Agent on laptop, Server on Docker Desktop - Pass
So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube rather than an issue with the agents or code. As for which of the 7 containers could be at fault... :man-shrugging: . I'm not seeing anything out of the ordinary in the logs. Is there a verbose setting in the agent that could help us diagnose, i.e. each step of what goes on inTask.init
?
43 Views
0
Answers
3 months ago
3 months ago