Any Ideas Why This Is Happening? It Was Fine Yesterday

Answered

Any ideas why this is happening? It was fine yesterday

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TenseOstrich47
				
					0
					 × 1

Votes Newest

Answers 14

TenseOstrich47 this sounds like a good idea.
When you have a script, please feel free to share, I think it will be useful for other users as well 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

From what I can tell, docker has some leakage here. Temp files are not removed correctly, resulting in the build up of disk storage usage.
See the following for more details
https://stackoverflow.com/questions/46672001/is-it-safe-to-clean-docker-overlay2
https://forums.docker.com/t/some-way-to-clean-up-identify-contents-of-var-lib-docker-overlay/30604
https://docs.docker.com/storage/storagedriver/overlayfs-driver/

Im going to write a clean up script and add that to the cron. I dont believe there is an easy way around this issue as docker trades off disk storage for simplicity

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TenseOstrich47
				
					0
					 × 1

After some additional inspection, seems like the issue is docker related.
7.7G /var/lib/docker/overlay2/ this is the directory which is causing the device storage issues.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TenseOstrich47
				
					0
					 × 1

ES can't use s3 for storage, nor can MongoDB

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thanks Jake, I will have a look. Is there a reason a lot disk space would be used on the server instance? Is there something in the config I can change to ensure that minimal memory is used on that server, and mostly s3 is used for storage?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TenseOstrich47
				
					0
					 × 1

Also here: https://allegro.ai/clearml/docs/docs/faq/faq.html#elastic_watermark

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

TenseOstrich47 see here: https://github.com/allegroai/clearml/issues/316#issuecomment-788995387

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

TenseOstrich47 this looks like elasticserach is out of space...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1523701157564780544:profile|TenseOstrich47> The storage in question here is what's available on the machine hosting the ClearML server's docker containers (specifically, the ES one).

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FrothyDog40
				
					0

that should be the case, we have default_output_uri: set to an s3 bucket

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RoundCat60
				
					0
					 × 1

I thought nothing should be stored locally on the agent? Shouldn't all files be logged to the storage rather than the instance itself?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TenseOstrich47
				
					0
					 × 1

@<1687643893996195840:profile|RoundCat60> Hey Alex. Could you take a look at this when you're free later on please

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TenseOstrich47
				
					0
					 × 1

@<1523701157564780544:profile|TenseOstrich47> This is typically indicative of insufficient server disk space causing ES to go into read-only mode or turn active shards into inactive or unassigned (see FAQ ).

The disk watermarks controlling the ES free-disk constraints are defined by default as % of the disk space (so it might look to you like you still have plenty of space, but ES thinks otherwise). You can configure different ES settings in the docker-compose.yml file (see here - there are 3 settings, all can be identical)

If you don't have enough free disk space, clean up files to create more, or resize your partition (or increase your disk size if on a cloud instance).

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FrothyDog40
				
					0

I literally cannot reset a single task

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TenseOstrich47
				
					0
					 × 1

Write your answer

2K Views

14 Answers

4 years ago

one year ago