Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello, I Am Also Having Issues When Migrating Our Clearml Install, Specifically From 0.17 To 1.0.0 (I Was Doing It For 1.0.2 At First But I'Ve Seen In Some Messages That To Avoid Issues We Should First Upgrade To 1.0.0, Though It Didn'T Change Anything In

Hello, I am also having issues when migrating our ClearML install, specifically from 0.17 to 1.0.0 (I was doing it for 1.0.2 at first but I've seen in some messages that to avoid issues we should first upgrade to 1.0.0, though it didn't change anything in my case)

Once ClearML is updated, about 25% of our experiments show up an error when we click on the on the Web UI:
Error 101 : Inconsistent data encountered in document: document=Output, field=model

But importantly, many experiments show up fine with no error message.

We are running ClearML on azure with k8s, using lightly modified official cloud k8s from ClearML, with a bunch of kmanifests for 0.17, and then using a lightly modified cloud-ready Helm chart for 1.0.0 (both are the official cloud k8s install method of their time)

If I roll back to our 0.17 install while keeping all of our storage, all experiments display fine, so I don't think it's because of corrupted or missing data.

Is there anything I should do before upgrading to 1.0.0?

  
  
Posted 2 years ago
Votes Newest

Answers 14


alright, here is the complete log

  
  
Posted 2 years ago

I used mongodump and consequently mongorestore

  
  
Posted 2 years ago

Also, how do you restore the database exactly?

  
  
Posted 2 years ago

OK

  
  
Posted 2 years ago

Hey SuccessfulKoala55 ,
Well I guess the steps I took were a bit messy in hindsight:
Remove 0.17 cluster while keeping storage (and making a backup of the storage to be safe) Install 1.0.2 helm chart with our existing storage Experience the aforementioned issue with some experiment data on the web UI Hunt this slack for similar issues Check that ES shards are still running - they are / seem to be at least Uninstall the 1.0.2 helm chart Install the same chart but with the tag manually changed to 1.0.0 for the ClearML images Problem persists Roll back to 0.17 with no obvious / apparent issues afaik
Is there any way to force ClearML to run this migration step manually?
Thanks!

  
  
Posted 2 years ago

Hi CleanSheep2 ,
It seems like the automatic migration might have failed when running the 1.0.0 server for the first time.
Can you explain exactly what step you took to upgrade?

  
  
Posted 2 years ago

OK, I tried to clear MongoDB, then restore it with the backup I made before doing the migration. After that, I launched ClearML version 1.0.0, and I get the same issues (again, the one where some experiments show the error, but not all).

  
  
Posted 2 years ago

Well, the main point is that in order for the new server to correctly migrate the data, it must be started for the first time with the data from the previous version present in mongodb

  
  
Posted 2 years ago

This is the log I get by using kubectl logs <api-server-pod-name> , not sure if that's the "proper" way to get it but here are the last 30 lines or so after clicking a few experiments:

[2021-06-28 13:58:03,358] [8] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 1072ms [2021-06-28 13:58:04,079] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_all_ex in 8ms [2021-06-28 13:58:04,173] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_task_tags in 1ms [2021-06-28 13:58:04,182] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_task_parents in 6ms [2021-06-28 13:58:04,184] [8] [INFO] [clearml.service_repo] Returned 200 for users.get_all_ex in 18ms [2021-06-28 13:58:04,218] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_types in 4ms [2021-06-28 13:58:04,227] [8] [WARNING] [clearml.service_repo] Returned 400 for tasks.get_all_ex in 0ms, msg=Validation error (invalid task field): path=hyperparams.General.n_estimators.value [2021-06-28 13:58:04,262] [8] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 2ms [2021-06-28 13:58:05,461] [8] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 1193ms [2021-06-28 13:58:06,205] [8] [INFO] [clearml.service_repo] Returned 200 for debug.ping in 0ms [2021-06-28 13:58:06,544] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 10ms [2021-06-28 13:58:06,566] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 28ms [2021-06-28 13:58:07,230] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_all_ex in 10ms [2021-06-28 13:58:07,230] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_all_ex in 6ms [2021-06-28 13:58:07,249] [8] [INFO] [clearml.service_repo] Returned 200 for users.get_all_ex in 7ms [2021-06-28 13:58:07,257] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_task_parents in 3ms [2021-06-28 13:58:07,262] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_task_tags in 1ms [2021-06-28 13:58:07,270] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_types in 3ms [2021-06-28 13:58:07,315] [8] [WARNING] [clearml.service_repo] Returned 400 for tasks.get_all_ex in 1ms, msg=Validation error (invalid task field): path=hyperparams.General.n_estimators.value [2021-06-28 13:58:07,917] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 4ms [2021-06-28 13:58:07,934] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 15ms [2021-06-28 13:58:08,429] [8] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 3ms [2021-06-28 13:58:09,057] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 11ms [2021-06-28 13:58:09,078] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 26ms [2021-06-28 13:58:09,541] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_all_ex in 9ms [2021-06-28 13:58:09,550] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_all_ex in 23ms [2021-06-28 13:58:09,561] [8] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 1052ms [2021-06-28 13:58:10,096] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 11ms [2021-06-28 13:58:10,103] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 28ms [2021-06-28 13:58:10,476] [8] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 2ms [2021-06-28 13:58:11,014] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_all_ex in 28ms [2021-06-28 13:58:11,027] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all_ex in 37ms [2021-06-28 13:58:11,161] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 14ms [2021-06-28 13:58:11,172] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 5ms [2021-06-28 13:58:11,176] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_task_parents in 6ms [2021-06-28 13:58:11,179] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_types in 4ms [2021-06-28 13:58:11,186] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_task_tags in 2ms [2021-06-28 13:58:11,200] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all_ex in 12ms [2021-06-28 13:58:11,201] [8] [INFO] [clearml.service_repo] Returned 200 for users.get_all_ex in 21ms [2021-06-28 13:58:11,623] [8] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 1140ms [2021-06-28 13:58:12,126] [8] [ERROR] [clearml.service_repo] Returned 500 for tasks.get_by_id_ex in 10ms, msg=Inconsistent data encountered in document: document=Output, field=model [2021-06-28 13:58:12,129] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.get_by_id_ex in 6ms [2021-06-28 13:58:12,337] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_all_ex in 18ms [2021-06-28 13:58:12,369] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_all_ex in 17ms [2021-06-28 13:58:13,347] [8] [INFO] [clearml.service_repo] Returned 200 for debug.ping in 0ms [2021-06-28 13:58:14,656] [8] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 3ms [2021-06-28 13:58:15,025] [8] [INFO] [clearml.service_repo] Returned 200 for projects.get_all_ex in 9ms

  
  
Posted 2 years ago

Well, I actually need to see the start of the log to find out how the server booted up 🙂

  
  
Posted 2 years ago

Can you share the apiserver log?

  
  
Posted 2 years ago

[2021-06-28 13:32:18,683] [8] [INFO] [clearml.apiserver.mongo.initialize.migration] Started mongodb migrations
[2021-06-28 13:32:18,689] [8] [INFO] [clearml.apiserver.mongo.initialize.migration] Finished mongodb migrations

This basically means the server did not detect a database belonging to a previous version... To properly debug this, you'll need to repeat the process of restoring the database from backup and starting 1.0.0 with it, and send me the complete log from that point 🙂

  
  
Posted 2 years ago

I should note that our team decided to keep things as is, as being blocked with this upgrade is more problematic than losing ~20% of our data, which is mostly non-mission critical, but of course if there is some kind of way to fix this without having to roll back, we're still interested.

  
  
Posted 2 years ago

We may just give up and live with our lost data, I'll come back to this thread if we change our mind and/or if we find something.

  
  
Posted 2 years ago