Yes they are. With mongo I had a problem connected with azurefiles and mongo who did not approve to mount azurefiles under /data/db as it could not initialize. The solution for that was to mount the azurefiles under different path and then specify command for mongo with path to the data so that it could initialize properly. However when I deleted a kubernetes cluster, created a new one and I redeployed clearml I had issues coming not from mongo anymore but from apiserver that was failing with migrations. And then I noticed the problem with elasticsearch that also had azurefiles mounted as volume. So I am wondering whether my errors are particular and there should be no problems with reinitializing the clearml environment using the data from previous environment, or is there something obvious that I am not following 😉
Are your backups from the same ClearML server version?
As for the clearml server version by latest tag I meant v 0.17.0
Hey SuccessfulKoala55 Thank you for your answers I really appreciate it. As for elasticsearch it was indeed the index error that was created before. The reason for that is that I was trying to setup a backup for elasticsearch and mongodb using azurefiles. So the scenario is I'm using persistent volumes on k8s that are using azure file shares as storage. Then I can rebuild my cluster and use the exact same storage so that the data is persistent and I can restore my application from the last state. I noticed a while ago that when trying to setup clearml from scratch using the backup data, the apiserver failed on mongodb migrations. Now after the issues with elasticsearch I know it concerns also elasticsearch. Therefore my question is, is there any way to reinitialize clearml but using the back up data for the services? Thank you in advance
Shouldn't I be using the latest tag for clearml?
What do you mean by the latest tag? What deployment are you using exactly?
I had some problems previously when changing something in apiserver forced me to redeploy everything in order for clearml to work properly. And I am wondering whether you have maybe some guidelines for that.
Can you elaborate on that? Basically, the apiserver can be restarted while other components (including mongodb, elstic and redis) stay up
If this is indeed performed by the server, the issue is most likely an woker_stats
stats index that was somehow created before the mapping could be applied - the quickest solution is to manually delete the index.
To check the server does try to perform this initialization on startup, check the apiserver
pod/container log for log lines similar to this:[2021-04-23 12:44:35,493] [31795] [INFO] [trains.initialize] Applying mappings to ES host: [ConfigTree([('host', '127.0.0.1'), ('port', 9200)])] [2021-04-23 12:44:36,193] [31795] [INFO] [trains.initialize] [{'mapping': 'events', 'result': {'acknowledged': True}}, {'mapping': 'events_training_debug_image', 'result': {'acknowledged': True}}, {'mapping': 'events_plot', 'result': {'acknowledged': True}}, {'mapping': 'events_log', 'result': {'acknowledged': True}}] [2021-04-23 12:44:36,193] [31795] [INFO] [trains.initialize] Applying mappings to ES host: [ConfigTree([('host', '127.0.0.1'), ('port', 9200)])] [2021-04-23 12:44:36,805] [31795] [INFO] [trains.initialize] [{'mapping': 'queue_metrics', 'result': {'acknowledged': True}}, {'mapping': 'worker_stats', 'result': {'acknowledged': True}}]
Note: this should be around the start of the log
Hi again GreasyPenguin66 🙂
For some reason, it looks like the mapping for the Elastic index containing the worker (agents) statistics were not initialized correctly - this happens automatically when the ClearML server starts up. The server might not perform this auto-initialization in case it suspects the ES data as originating from an un-migrated pre-v16 Trains Server deployment (I'm not sure this is the case here)
And also another question came to my mind. When changing any deployment for clearml like apiserver or mongo or elasticsearch etc. do I have to redeploy everything from the scratch? I had some problems previously when changing something in apiserver forced me to redeploy everything in order for clearml to work properly. And I am wondering whether you have maybe some guidelines for that.
Unfortunately the problem was not resolved nor by changing the vm memory settings back to 2 gb and by going back from azurefiles persistent volumes to hostPath. Seems odd as I did not have any of these issues before. I thought it might come from the changes in PV and elasticsearch settings but going back to the original settings did not resolve the issue. Shouldn't I be using the latest tag for clearml?
Hi SuccessfulKoala55 Thanks for the response. For elastic I am using the image http://docker.elastic.co/elasticsearch/elasticsearch:7.6.2 the one that is in manifests in clearml repo. As for the clearml images I am using the latest tags everywhere. Let me restore the vm settings for elastic and I'll let you know ;)
Also, what clearml-server version are you running, and what is the Elastic image version?
It might be caused by a memory issue - I'd suggest restoring the VMs memory setting, just to make sure it's not the cause for the issue
Hi GreasyPenguin66 , this looks like some sort of a mapping issue in Elastic...