thanks for letting us know, I took a n ote for more tests on liveness, ty again!
It turned out that the issue was caused by my network environment. Somehow my network environment was throttled and led to the issue. Changing to a better network environment made it work.
However, when I tried to upload even larger artifacts in a row (around 200MB for each), it failed due to the failure of livenessprob
and readinessprob
of fileserver
pod. By default, the timeout of the two probes is 1s. I increased the timeout to 100s and that fixed the issue. @<1523701827080556544:profile|JuicyFox94> @<1523701087100473344:profile|SuccessfulKoala55>
I know of deployments where people are uploading hundreds of MBs to the fileserver, so I don't think this is related
Is it possible that there is a bug in the fileserver
that prevents us uploading a large file (say around 25MB)? Btw, if I switch the default output URI in the SDK to upload to a Azure blob storage instead of fileserver
, the functionality works good.
I don't think this is not something we can configure in the fileserver...
Ok so we can exclude a timeout due to an ingress controller in the middle. It looks more something related connection management in Fileserver. @<1523701087100473344:profile|SuccessfulKoala55> Do we have a way to pass some envvar to file manager as extraenv to mitigate or fix this behavior?
Hi @<1523701827080556544:profile|JuicyFox94> , no, I expose the services using NodePort
Maybe @<1523701087100473344:profile|SuccessfulKoala55> or @<1523701827080556544:profile|JuicyFox94> might have some insight into this 🙂
Hi CostlyOstrich36 , I deployed the ClearML server in a k8s cluster using helm chart of version 5.5.0: https://github.com/allegroai/clearml-helm-charts/tree/clearml-5.5.0/charts/clearml , which deployed v1.9.2 server, I think.
For the SDK, I am using v1.9.1.
Hi NervousRabbit2 , what version of ClearML server are you running? Also what clearml
version are you using?