thanks so much!
I've been running a bunch of tests with timers and seeing an absurd amount of variance. Ive seen parameters connect and task create in seconds and other times it takes 4 minutes.
Since I see timeout connection errors somewhat regularly, I'm wondering if perhaps I'm having networking errors. Is there a way (at the class level) to control the retry logic on connecting to the API server?
my operating theory is that some sort of backoff / timeout (eg 10s) is causing the high variance.
To test this, I spun up a local instance and pointing to localhost
as well as the normal reverse-proxy, and found that localhost
had "overhead times" that were completely reasonable - practically none at all.
The difference in the two screenshots is literally only the URLs in clearml.conf
and it went from 30s down to 2-3s.
(server has been destroyed already, not worried about the keys showing)