SlimyDove85 this seems to be some network error with the ClearML Server
API server does not restart during the process. I'll try to see if I catch up something in its logs or where should I monitor the networking in? I.e., what is the flow 😅
this is a connection fail from agent to apiserver. the flow should be aget-pod -> apiserver svc -> apiserver pod. maybe also apiserver can have something in ogs that can be checked
Ok, was able to get a crash and log some output from the apiserver:
[2022-08-11 09:21:13,727] [11] [INFO] [clearml.service_repo] Returned 200 for tasks.stopped in 17ms [2022-08-11 09:21:13,829] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 11ms [2022-08-11 09:21:13,871] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 8ms [2022-08-11 09:21:13,986] [11] [WARNING] [clearml.service_repo] Returned 400 for queues.get_by_id in 4ms, msg=Invalid queue id: id=feature_pipelines, company=d1bd92a3b039400cbafc60a7a5b1e52b [2022-08-11 09:21:14,217] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 10ms [2022-08-11 09:21:14,491] [11] [INFO] [clearml.service_repo] Returned 200 for tasks.enqueue in 21ms [2022-08-11 09:21:15,125] [11] [INFO] [clearml.service_repo] Returned 200 for debug.ping in 0ms [2022-08-11 09:21:15,128] [11] [INFO] [clearml.service_repo] Returned 200 for debug.ping in 0ms [2022-08-11 09:21:15,677] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 2ms [2022-08-11 09:21:15,754] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 9ms [2022-08-11 09:21:17,728] [11] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all_ex in 22ms [2022-08-11 09:21:18,845] [11] [WARNING] [clearml.service_repo] Returned 400 for in 0ms, msg=Invalid request path / [2022-08-11 09:21:18,847] [11] [WARNING] [clearml.service_repo] Returned 400 for in 0ms, msg=Invalid request path / [2022-08-11 09:21:18,854] [11] [WARNING] [clearml.service_repo] Returned 400 for in 0ms, msg=Invalid request path / [2022-08-11 09:21:19,152] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 9ms [2022-08-11 09:21:19,158] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 10ms [2022-08-11 09:21:20,212] [11] [INFO] [clearml.service_repo] Returned 200 for workers.status_report in 4ms [2022-08-11 09:21:20,277] [11] [WARNING] [clearml.service_repo] Returned 400 for queues.create in 6ms, msg=Value combination already exists (unique field already contains this value): name=df6a44f0f80648a3a2edc0a970944ba7, company=d1bd92a3b039400cbafc60a7a5b1e52b
We have CloudWatch also configured, so I could probably do some searches there if I knew what to look for
The queue 'feature_pipelines" should exist and the latter queue is something that the agents sometimes want to create for some reason (though it should not be required?)
Latter warning is ok I guess.
This is clearly a network issue; first I’d check there are no restarts of apiserver during that timespan. It’s not easy to debug this since it looks to be random but it can be interesting to check k8s networking configuration overall just to be sure.
On AWS EKS with:
Image: allegroai/clearml-agent-k8s-base
clearml-agent version: 1.2.4rc3
python: 3.6.9