I know of deployments where people are uploading hundreds of MBs to the fileserver, so I don't think this is related
thanks for letting us know, I took a n ote for more tests on liveness, ty again!
Is it possible that there is a bug in the fileserver
that prevents us uploading a large file (say around 25MB)? Btw, if I switch the default output URI in the SDK to upload to a Azure blob storage instead of fileserver
, the functionality works good.
I don't think this is not something we can configure in the fileserver...
Hi NervousRabbit2 , what version of ClearML server are you running? Also what clearml
version are you using?
Hi CostlyOstrich36 , I deployed the ClearML server in a k8s cluster using helm chart of version 5.5.0: https://github.com/allegroai/clearml-helm-charts/tree/clearml-5.5.0/charts/clearml , which deployed v1.9.2 server, I think.
For the SDK, I am using v1.9.1.
Maybe @<1523701087100473344:profile|SuccessfulKoala55> or @<1523701827080556544:profile|JuicyFox94> might have some insight into this 🙂
Hi @<1523701827080556544:profile|JuicyFox94> , no, I expose the services using NodePort
Ok so we can exclude a timeout due to an ingress controller in the middle. It looks more something related connection management in Fileserver. @<1523701087100473344:profile|SuccessfulKoala55> Do we have a way to pass some envvar to file manager as extraenv to mitigate or fix this behavior?
It turned out that the issue was caused by my network environment. Somehow my network environment was throttled and led to the issue. Changing to a better network environment made it work.
However, when I tried to upload even larger artifacts in a row (around 200MB for each), it failed due to the failure of livenessprob
and readinessprob
of fileserver
pod. By default, the timeout of the two probes is 1s. I increased the timeout to 100s and that fixed the issue. @<1523701827080556544:profile|JuicyFox94> @<1523701087100473344:profile|SuccessfulKoala55>