We were able to find an error from the autoscalaer agent:
Stuck spun instance dynamic_worker:clearml-agent-autoscale:p2.xlarge:i-015001a93e0910a09 of type clearml-agent-autoscale
2022-04-19 19:16:58,339 - clearml.auto_scaler - INFO - Spinning down stuck worker: 'dynamic_worker:clearml-agent-autoscale:p2.xlarge:i-015001a93e0910a09
Hi CloudySwallow27
This error occurs randomly during training (in other words training does successfully start).
What's the cleamrl-agent version you are using, and the clearml version ?
Worker CLEARML-AGENT version 1.1.2
The autoscaler instance Clearml-AGENT version: 1.2.3
ClearML WebApp: 1.2.0-153 Server: 1.2.0-153 API: 2.16
WonderfulArcticwolf3 and CloudySwallow27 are you running it as a service or via the apps? whats the clearml version (not agent)?
TimelyPenguin76 not sure what you mean by "as a service or via the apps", but we are self-hosting it. Does that answer the question?
Also, not sure what you mean by which "clearml version". How do we check this? The clearml python package is 1.1.4. Is that what you wanted?
CloudySwallow27 yes, this is what I wanted to know, can you try with the latest clearml version? pip install clearml==1.3.2
?