Hello, Does Anyone Else Have Trouble With Deleting Experiments? Sometimes When Deleting 10 Or So Experiments Some Errors Pop Out And The Entire System Becomes Unstable (Workers Do Not Show Up, Cannot Reset Experiments Etc) This Behaviour Was Not Fixed W

Answered

Hello,

Does anyone else have trouble with deleting experiments? Sometimes when deleting 10 or so experiments some errors pop out and the entire system becomes unstable (workers do not show up, cannot reset experiments etc)

This behaviour was not fixed with 1.5.0. log skimming showed quite a lot of timeouts

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

Votes Newest

Answers 26

Okay, thank you for the suggestions, we'll try it out

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

I would suggest (assuming the machine has enough RAM memory) to set it to at least -Xms4g -Xmx4g and maybe more. You'll need at least twice than that free for ES alone (so make sure your machine has at least 16GB RAM)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

So currently it's -Xms2g -Xmx2g which means 2GB

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

sure

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

Can you send what you have there now?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

we didn't change a thing from the defaults that's in your github 😄 so it's 500M?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

it's in the default env vars for elasticsearch in the docker compose

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

how much memory do you have assigned to ES?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

i think you're right, the default elastic values do not seem to work for us

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

the entire index is 35G

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

How large are your ES indices? Maybe this is ES being inefficient?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I guess I'll let you know the next time this happens haha

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

No errors in logs, but that's because I restarted the deployment :(

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

This was actually a reset (of a one experiment) not a delete

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

Any error in the apiserver log? (sudo docker logs clearml-apiserver)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

And you deleted a single experiment? Or many?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yes, that's right. We deployed it on a GCP instance

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

So this seems to be a purely load issue - can you remind me what deployment type you are using? docker-compose, right?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Nothing at all. There are only 2 logs from this day, and all were at 2am

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

Can you try to get the ES log using docker logs clearml-elastic ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hello, a similar thing happened today. In the developer's console there was this line

https://server/api/v2.19/tasks.reset_many 504 (Gateway time-out)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

For now, docker compose down && docker compose up -d helps

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

I haven't looked, I'll let you know next time it happens

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

Anything you can see in the browser's JS console or in the Developer Tools Network section?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Errors pop in occasionally in the Web UI. All we see is a dialog with the text "Error"

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

Hi RotundHedgehog76 ,
Where exactly do you see errors?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

26 Answers

3 years ago

2 years ago