Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, We Have Clearml Version Webapp: 1.3.0-165 • Server: 1.3.0-165 • Api: 2.17 Suddenly All Experiments Were Removed. After Checking Logs We See That Before 6 Days Elasticsearch Container Was Terminated Due To Out Of Memory And New One Was Created Instead

HI,
We have Clearml version WebApp: 1.3.0-165 • Server: 1.3.0-165 • API: 2.17

Suddenly all experiments were removed.
After checking logs we see that before 6 days Elasticsearch container was terminated due to Out Of Memory and new one was created instead.
In Clearml API logs from yesterday we see following.
[root@ip-10-156-91-102 ~]# grep delete clearml_api.out [2022-11-21 17:31:41,509] [8] [WARNING] [elasticsearch] POST [status:N/A request:60.057s] [2022-11-21 17:37:03,230] [8] [INFO] [clearml.service_repo] Returned 200 for tasks.delete_many in 53657msDo you know why data was lost and is there any restore mechanism?

  
  
Posted 2 years ago
Votes Newest

Answers 10


Hi LackadaisicalHedgehong78 . It seems that someone/something sent a command to delete a bunch of tasks. Do you have backups?

  
  
Posted 2 years ago

Elastic only holds part of the tasks data. Mongo is actually what stores the task objects. Can you look inside to see whats there?

  
  
Posted 2 years ago

Please check the task__trash collection in mongodb backend database. If you find all your tasks there then someone indeed deleted them

  
  
Posted 2 years ago

Hi CostlyOstrich36
We indeed see tasks in task__trash collection in mongodb backend database.
Is there any way to restore it?

Also can we see in logs who triggered the command?

  
  
Posted 2 years ago

You can restore these tasks by copying or moving them from task__trash into task collection. But the events for these tasks cannot be restored. About the user who deleted them unfortunately ClearML does not record this info in Mongo and without logging to ES there is no place to retrieve it (I can suggest using Kibana to monitor ES). You can try to inspect the mongo collection url_to_delete. It contains all the links from the deleted tasks that should be removed from the fileserver. If you see there any documents that correspond to files from the deleted tasks then the user recorded in this docs is the one who performed the delete.

  
  
Posted 2 years ago

Thanks CostlyOstrich36
Actually I was able to find IP of the machine where API call was triggered in web logs and found the user who run the delete action.
User tried to remove only archived experiments in his project( tried several times and got some errors ) and that is what we see in API call - somehow Clearml removed all server experiments 🤔
Any idea why this might happened if user only run "delete archived experiments of his project" in WEB UI ?
xx.xxx.xxx.xx - - [21/Nov/2022:17:39:14 +0000] "POST /api/v2.17/tasks.delete_many HTTP/1.1" 499 0 " " "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36" "-"

  
  
Posted 2 years ago

We tried with copying on test machine and it worked(delete and then restore tasks in DB - it appears again in UI).
When did same on prod - nothing happened.
Also we see all data in the fileserver is not removed and mongo shows 17000 tasks so it looks like tasks were removed from UI but still appear in Mongo DB and locally at file system.
CostlyOstrich36
` > db.task.count()
17262

db.task__trash.count()
383
db.task__trash__trash.count()
15
db.task__trash.aggregate([ {$merge: "task"}])
db.task__trash__trash.count()
15
db.task__trash.count()
385
db.task.count()
17647 `

  
  
Posted 2 years ago

When going to the UI, open developer tools (F12) and see what returns when you go to 'all experiments' to see what is called and what is returned for tasks.get_all_ex

  
  
Posted 2 years ago

CostlyOstrich36
NO errors in developer tools and result code is 200:
{"meta":{"id":"f131cde7b77545a5b4802e73f1b5e78e","trx":"f131cde7b77545a5b4802e73f1b5e78e","endpoint":{"name":"tasks.get_all_ex","requested_version":"2.17","actual_version":"1.0"},"result_code":200,"result_subcode":0,"result_msg":"OK","error_stack":"","error_data":{}},"data":{"tasks":[{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"4b2fcf54203e4930b7a9a7b511e31ca3","last_change":"2022-11-21T20:40:53.107000+00:00","last_iteration":274900,"last_update":"2022-11-21T20:40:53.107000+00:00","name":"mae_arch_masked_75_random_fix-eval","project":{"id":"316e67462401437e9f17971564a3e5e2","name":"beit_v2/ablation"},"started":"2022-11-21T17:50:09.835000+00:00","status":"completed","system_tags":["development"],"tags":[],"type":"training","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"6badc9f25ae44e1ea50c3907c54304f9","last_change":"2022-11-21T17:57:09.176000+00:00","last_iteration":10,"last_update":"2022-11-21T17:57:09.176000+00:00","name":"bb_fitter_cpu_21","project":{"id":"ed78ce96b0654383be5de08b4a49a437","name":"box_fitter"},"started":"2022-11-21T17:51:00.789000+00:00","status":"stopped","system_tags":["development"],"tags":[],"type":"training","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"3ac0f27555fe4c0fb90d608d88dd781f","last_change":"2022-11-22T07:43:08.636000+00:00","last_iteration":27,"last_update":"2022-11-22T07:43:08.636000+00:00","name":"cp-141-filter-below-32-pts-evaluate","project":{"id":"aa617227670a4c65b314def279557ddd","name":"fusion-ptk/centerpoint"},"started":"2022-11-21T18:01:53.991000+00:00","status":"completed","system_tags":["development"],"tags":["adamkapl"],"type":"testing","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"c0b950b59d8e430d95a67c3e515bcbe9","last_change":"2022-11-21T20:08:04.351000+00:00","last_iteration":746,"last_update":"2022-11-21T20:08:04.351000+00:00","name":"LRG-hm-only-10epochs-7frames-evaluate","project":{"id":"60ff76d2a0b44527ab54778aae0125cc","name":"fusion-ptk/LidarRoadGeometryGen2"},"started":"2022-11-21T18:19:06.004000+00:00","status":"completed","system_tags":["development"],"tags":["aryehn"],"type":"testing","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"952d80049c5a49e6bd9abb5e422db034","last_change":"2022-11-21T20:29:23.826000+00:00","last_iteration":3,"last_update":"2022-11-21T20:29:23.826000+00:00","name":"bb_fitter_cpu_11","project":{"id":"ed78ce96b0654383be5de08b4a49a437","name":"box_fitter"},"started":"2022-11-21T18:21:46.538000+00:00","status":"stopped","system_tags":["development"],"tags":[],"type":"training","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"b274410fe2df4999acceb9a4552512af","last_change":"2022-11-22T01:32:54.475000+00:00","last_iteration":26480,"last_update":"2022-11-22T01:32:54.475000+00:00","name":"cp-142-fltr-post-agmnt-front","project":{"id":"aa617227670a4c65b314def279557ddd","name":"fusion-ptk/centerpoint"},"started":"2022-11-21T18:31:39.042000+00:00","status":"completed","system_tags":["development"],"tags":["adamkapl"],"type":"training","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"a41e7e90f4f947febae4e465b7c573b4","last_change":"2022-11-22T08:56:25.553000+00:00","last_iteration":145,"last_update":"2022-11-22T08:56:25.553000+00:00","name":"bb_fitter_cpu1_2","project":{"id":"ed78ce96b0654383be5de08b4a49a437","name":"box_fitter"},"started":"2022-11-21T18:32:50.456000+00:00","status":"stopped","system_tags":["development"],"tags":[],"type":"training","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"45040a9e5ad34dc7a19efe5261af3555","last_change":"2022-11-23T13:05:53.042000+00:00","last_iteration":240000,"last_update":"2022-11-23T13:05:53.042000+00:00","name":"cloud-head-no-dups-mask","project":{"id":"6249944a4b6a401b8c5b429ce6e49232","name":"mae/tsr-downstream-classification"},"started":"2022-11-21T18:33:00.953000+00:00","status":"stopped","system_tags":["development"],"tags":[],"type":"training","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"60961a23b45c4dbe81ff693d53cc7873","last_change":"2022-11-22T01:09:12.390000+00:00","last_iteration":26000,"last_update":"2022-11-22T01:09:12.390000+00:00","name":"cp-143-fltr-post-agmnt-right","project":{"id":"aa617227670a4c65b314def279557ddd","name":"fusion-ptk/centerpoint"},"started":"2022-11-21T18:34:04.889000+00:00","status":"completed","system_tags":["development"],"tags":["adamkapl"],"type":"training","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"344cee521d854601a2d23a07d7c5af3c","last_change":"2022-11-23T11:29:34.993000+00:00","last_iteration":250000,"last_update":"2022-11-23T11:29:34.993000+00:00","name":"cloud-head-no-dups-w-mean-class","project":{"id":"6249944a4b6a401b8c5b429ce6e49232","name":"mae/tsr-downstream-classification"},"started":"2022-11-21T18:36:50.327000+00:00","status":"stopped","system_tags":["development"],"tags":[],"type":"training","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"f926d65de1384aef89c3f25f8baa6fd0","last_change":"2022-11-23T14:29:36.020000+00:00","last_iteration":250000,"last_update":"2022-11-23T14:29:36.020000+00:00","name":"cloud-head-no-dups-w-mean-feature","project":{"id":"6249944a4b6a401b8c5b429ce6e49232","name":"mae/tsr-downstream-classification"},"started":"2022-11-21T18:42:55.661000+00:00","status":"stopped","system_tags":["development"],"tags":[],"type":"training","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"00bf5ba428ea4e9db2d773fb26ecab13","last_change":"2022-11-22T01:45:16.395000+00:00","last_iteration":39,"last_update":"2022-11-22T01:45:16.395000+00:00","name":"cp-142-fltr-post-agmnt-front-evaluate","project":{"id":"aa617227670a4c65b314def279557ddd","name":"fusion-ptk/centerpoint"},"started":"2022-11-21T18:49:16.200000+00:00","status":"completed","system_tags":["development"],"tags":["adamkapl"],"type":"testing","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"10eb8c3cbe7f4d188c5c30073da077d8","last_change":"2022-11-22T01:19:58.806000+00:00","last_iteration":39,"last_update":"2022-11-22T01:19:58.806000+00:00","name":"cp-143-fltr-post-agmnt-right-evaluate","project":{"id":"aa617227670a4c65b314def279557ddd","name":"fusion-ptk/centerpoint"},"started":"2022-11-21T18:51:21.202000+00:00","status":"completed","system_tags":["development"],"tags":["adamkapl"],"type":"testing","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"367b275d60b646aa8781e1f421c4eec8","last_change":"2022-11-22T13:28:26.071000+00:00","last_iteration":107300,"last_update":"2022-11-22T13:28:26.071000+00:00","name":"bf_main_aug_wd001_5a","project":{"id":"ed78ce96b0654383be5de08b4a49a437","name":"box_fitter"},"started":"2022-11-21T19:43:48.115000+00:00","status":"completed","system_tags":["development"],"tags":[],"type":"training","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}},{"company":{"id":"d1bd92a3b039400cbafc60a7a5b1e52b"},"id":"54fde807ef204a1a87b1825fc6b7f91e","last_change":"2022-11-23T16:14:36.763000+00:00","last_iteration":2,"last_update":"2022-11-23T16:14:36.763000+00:00","name":"cp-128-fix-negloss","project":{"id":"aa617227670a4c65b314def279557ddd","name":"fusion-ptk/centerpoint"},"started":"2022-11-21T20:37:05.014000+00:00","status":"stopped","system_tags":["development"],"tags":["nivk","shazut"],"type":"testing","user":{"id":"0283ba889ae7105ba1db4e91bdada228","name":"Trains default user"}}],"scroll_id":"ac0595892c3f4ca59c02864df56f3adf"}}

  
  
Posted 2 years ago

FYI CostlyOstrich36
After CLearML restart, all experiments appear again 😃

  
  
Posted 2 years ago
1K Views
10 Answers
2 years ago
one year ago
Tags