Hi, We Have Recurring Disk Space Issues On Our Clearml Server (Drop Of Many Gb In A Few Days). After Some Analysis, We Noted

Answered

Hi, we have recurring disk space issues on our ClearML server (Drop of many GB in a few days). After some analysis, we noted /opt/clearml/data/elastic_7 to be the issues. Our ClearML version is 1.1.1-135 , 1.1.1-2.14.
Is this common? What can we do to limit this. Looks like index and translog under elastic_7 folder has the worst impact thus far.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 11

I would suggest to first look at the indices list and decide. In general - if that data is related to experiments, and you do not want to delete them (which makes sense), than yes - more disk space.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

What do you mean by drop of many GB? Can you please elaborate on what happens exactly?

I know that elastic can sometimes create disk corruptions and requires regular backups..

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 rDH57uOvTOCoRpUv53Ub2g 1 1 12288020 0 768.8mb 768.8mb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-01 lBQrjobDSf-7peKdOX8tlw 1 1 11067622 0 681.9mb 681.9mb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 2U6CliNdTiqd0VaSiTdBLQ 1 1 10634974 0 650.8mb 650.8mb yellow open events-plot- rJWReTYsSTKpFkps1AB1qA 1 1 161 0 362.8kb 362.8kb red open events-log-d1bd92a3b039400cbafc60a7a5b1e52b PSIKjKrKR9OsCVJ4IFd78w 1 1 yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-12 GlKMe1iSTa-s0L1HibtHcQ 1 1 10357452 0 630.5mb 630.5mb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-02 2GLFKyrrR3O0ikq2eSw9Pw 1 1 6172245 0 373.5mb 373.5mb yellow open events-log- iAbKcLsrQ1ecVlD4vfeIFg 1 1 1387 0 314.7kb 314.7kb yellow open events-training_debug_image- ZlQoHuAfSh2nlCm00PmZEg 1 1 196 0 124.8kb 124.8kb yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b RJ81T9MsTZ-pNOZNisg1oQ 1 1 317444 10 4gb 4gb yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b lsf6IJ95RbasjxoLbdNAgw 1 1 106971071 2107378 13.9gb 13.9gb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-03 Vgr_NJ07RYGTDog_l1Lsaw 1 1 12914 0 676.8kb 676.8kb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-04 7DvSIRnpRguKwIAOMh7I7A 1 1 715087 0 37.4mb 37.4mb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 XzYvsbNxReuikWH_aPj92A 1 1 26028975 0 1.9gb 1.9gb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-01 8IhLc__BTDCUSamUOikGOA 1 1 20578483 0 1.4gb 1.4gb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 x6rpG5c3S4uBsnYGIrlfmg 1 1 20925616 1 1.5gb 1.5gb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-02 DO1iDZ2EQCOtBITunz1hRw 1 1 9236665 0 629mb 629mb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-12 qxyRBfWoSTScLi_maCW-JA 1 1 22121020 0 1.6gb 1.6gb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-07 _sbQX0HAT3WHj5CCOHwjGw 1 1 5404825 0 322.8mb 322.8mb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-08 7RAQ1VtDSR2H-ZmnBI0WUg 1 1 8850700 0 534.5mb 534.5mb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 PGlYuKRJQSeaXYPy46KmXw 1 1 1219959 0 67.6mb 67.6mb yellow open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b _-NjgfrjQVGt6Xu12VjF6w 1 1 933336 27243 169.4mb 169.4mb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06 3s4gCWojToqKTXe1TamzAA 1 1 1531239 0 88.8mb 88.8mb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-06 21l9rDA2TkuMK0vtj2YUfg 1 1 3620978 0 259.5mb 259.5mb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-07 GysxIqZTRNanaprJFY_xLA 1 1 13820105 0 1gb 1gb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-09 eGmfj0rCT6aajIc5rZ88jw 1 1 12272473 0 765.7mb 765.7mb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-08 QXBh1RSGTguL2_h59YLOvw 1 1 19336383 0 1.4gb 1.4gb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-09 zdjlsiBmTqarhYexRLC9aQ 1 1 23450059 0 1.6gb 1.6gb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-03 Lq01ZCD1QQi8ABCQNHJ4yQ 1 1 13008 0 786.6kb 786.6kb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-04 LxztXhOlTkyVb7eXOkP3bA 1 1 947917 0 61.6mb 61.6mb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 Ke8b8Gd9Sy6xlw6T5vyhpA 1 1 1553156 0 99.7mb 99.7mb yellow open events-training_stats_scalar- HSpWH1c9T52EsbnVSCYL_w 1 1 3312 0 455.9kb 455.9kbThis is the logs that we extracted from the elastic-search docker image. Can I ask which are safe to delete?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					StrongDove86
				
					0

Thank you SuccessfulKoala55 Is there a flag in docker compose that we can include to only let elastisearch store 2-3 months of indices and clear it when it's longer than 3 months?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					StrongDove86
				
					0

Thanks SuccessfulKoala55 , how might I do this clean up? Does this increase with more use of ClearML? And to add, we save all artifacts onto a remote S3 server.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

I'm afraid elasticsearch doesn't have this option, but it can be handled by a small daily (or monthly) maintenance Cron script using a few simple curl commands

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SubstantialElk6 it basically depends on the amount of data you store there... There's no server-side process that should suddenly impact the ES storage. I would start by listing the ES indices and deleting any old ones that are not needed any more (for example, old queue metrics and worker stats)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Well, some indices contain experiment data (metrics) which you clean up by deleting (or resetting) experiments.
Other indices, which are indeed added over time, hold historical data and can be deleted.
You can start by doing curl http://localhost:9200/_cat/indices?v=true to see the list of indices - you can post it here if you'd like 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi SuccessfulKoala55 can I check is it possible to remove events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b ? what does it actually store

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					StrongDove86
				
					0

Basically, deleting worker_stats_* and queue_metrics_* is perfectly safe. I think you'll solve your space issues by deleting those 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

ok thanks. this would mean that increasing the disk space for my ClearML is the only option as we are not at liberty to delete.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Write your answer

2K Views

11 Answers

3 years ago

2 years ago