Hi ManiacalLizard2 , it feels like something related to the resources of the server or networking and it's having a hard time retrieving the data from ES. What resources have you allocated for the API server/ ES?
Sporadic failure to retrieve Scalars and Console logs.
Context: self-hosted in Azure with 2 separate Azure Container App for the UI and API server.
ElasticSearch, MongoDB as Azure service subscription.
Symptom: for long running task, we sometime get error failing to fetch Scalars and/or Console log in the WebUI. With enough "refreshing the page", the Scalars/Console log are retrieved and display as normal. The issue happen more often with big task (eg 12k iterations)
We managed to reproduce the issue with curl
API call, so we don't think it's a problem related to the WebUI:
curl -v -X POST
-d '{"task":"4c9224c6ec82425bbd66256de45c0e23","key":"iter"}' --output clearml_out.gz -H 'Accept: application/json' \
-H 'Accept-Encoding: gzip, deflate, br, zstd' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Connection: keep-alive' \
-H 'Content-Length: 56' \
-H 'Content-Type: application/json' \
-H 'Cookie: _ga=GA1.2.9817188[REDACTED]' \
-H 'Host: REDACTED.azurecontainerapps.io' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36 Edg/133.0.0.0' \
-H 'X-Allegro-Client: Webapp-2.0.0-613' \
-H 'sec-ch-ua: "Not(A:Brand";v="99", "Microsoft Edge";v="133", "Chromium";v="133"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Windows"'
which yield this error:
} [56 bytes data]
100 56 0 0 100 56 0 46 0:00:01 0:00:01 --:--:-- 46< HTTP/1.1 200 OK
< server: nginx/1.22.1
< date: Sun, 16 Feb 2025 19:04:39 GMT
< content-type: application/json
< content-length: 79288
< vary: Accept-Encoding
< content-encoding: zstd
<
{ [15180 bytes data]
* transfer closed with 13947 bytes remaining to read
82 79344 82 65341 100 56 35765 30 0:00:02 0:00:01 0:00:01 35814
* Closing connection
} [5 bytes data]
* TLSv1.3 (OUT), TLS alert, close notify (256):
} [2 bytes data]
curl: (18) transfer closed with 13947 bytes remaining to read
We did not find any relevant message in ES log:
Do you have any tip/hint how we can diagnose this issue further ? SuccessfulKoala55 CostlyOstrich36 Thanks in advance 😉