Maybe @<1523701087100473344:profile|SuccessfulKoala55> or @<1523701827080556544:profile|JuicyFox94> might have some insight into this 🙂
Is it possible that there is a bug in the fileserver
that prevents us uploading a large file (say around 25MB)? Btw, if I switch the default output URI in the SDK to upload to a Azure blob storage instead of fileserver
, the functionality works good.
It turned out that the issue was caused by my network environment. Somehow my network environment was throttled and led to the issue. Changing to a better network environment made it work.
However, when I tried to upload even larger artifacts in a row (around 200MB for each), it failed due to the failure of livenessprob
and readinessprob
of fileserver
pod. By default, the timeout of the two probes is 1s. I increased the timeout to 100s and that fixed the issue. @<1523701827080556544:profile|JuicyFox94> @<1523701087100473344:profile|SuccessfulKoala55>
I don't think this is not something we can configure in the fileserver...
Hi NervousRabbit2 , what version of ClearML server are you running? Also what clearml
version are you using?
I know of deployments where people are uploading hundreds of MBs to the fileserver, so I don't think this is related
thanks for letting us know, I took a n ote for more tests on liveness, ty again!
Hi @<1523701827080556544:profile|JuicyFox94> , no, I expose the services using NodePort
Hi CostlyOstrich36 , I deployed the ClearML server in a k8s cluster using helm chart of version 5.5.0: https://github.com/allegroai/clearml-helm-charts/tree/clearml-5.5.0/charts/clearml , which deployed v1.9.2 server, I think.
For the SDK, I am using v1.9.1.
Ok so we can exclude a timeout due to an ingress controller in the middle. It looks more something related connection management in Fileserver. @<1523701087100473344:profile|SuccessfulKoala55> Do we have a way to pass some envvar to file manager as extraenv to mitigate or fix this behavior?