Reputation
Badges 1
7 × Eureka!Hey SuccessfulKoala55 ,
Well I guess the steps I took were a bit messy in hindsight:
Remove 0.17 cluster while keeping storage (and making a backup of the storage to be safe) Install 1.0.2 helm chart with our existing storage Experience the aforementioned issue with some experiment data on the web UI Hunt this slack for similar issues Check that ES shards are still running - they are / seem to be at least Uninstall the 1.0.2 helm chart Install the same chart but with the tag manually changed...
I should note that our team decided to keep things as is, as being blocked with this upgrade is more problematic than losing ~20% of our data, which is mostly non-mission critical, but of course if there is some kind of way to fix this without having to roll back, we're still interested.
OK, I tried to clear MongoDB, then restore it with the backup I made before doing the migration. After that, I launched ClearML version 1.0.0, and I get the same issues (again, the one where some experiments show the error, but not all).
This is the log I get by using kubectl logs <api-server-pod-name>
, not sure if that's the "proper" way to get it but here are the last 30 lines or so after clicking a few experiments:
` [2021-06-28 13:58:03,358] [8] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 1072ms
[2021-06-28 13:58:04,079] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_all_ex in 8ms
[2021-06-28 13:58:04,173] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_t...
I used mongodump
and consequently mongorestore
alright, here is the complete log
We may just give up and live with our lost data, I'll come back to this thread if we change our mind and/or if we find something.