Hi, I Would Like To Check What Would Be The Recommended Hardware Specs For The Server Host Clearml Server. I Had One Configured With 32 Cpu Cores, 64Gb Ram And I Noticed That If We Have A Surge In Remote Task Creation, The Following Delays Occurs.

Answered

Hi, i would like to check what would be the recommended hardware specs for the server host ClearML server.

I had one configured with 32 CPU cores, 64GB ram and i noticed that if we have a surge in remote task creation, the following delays occurs.
Each individual task creation can be delayed for quite a while, compare to no delays when only one or two tasks are created.task = Task.init(...) task.set_base_docker("...") task.execute_remotely(..., exit_process=True)In the following logs on the client, the execution could 'hang' for up to 20 secs on any line.
ClearML Task: created new task id=7563485622 ClearML results page: https://...../output/log clearml.Task - INFO - Waiting for repository detection and full package analysis clearml - WARNING - Switching to remote execution, outout log page http://.../output/log clearml - WARNING - Terminating local execution process
2. The ClearML web interface starts to lag significantly.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 10

We are using k8s glue to spawn the job. ...

I think this is actual network latency, nothing to do with the jobs, could it be the server is very far away?
What happens when you manually start a Task from your machine ?
Is the latency fixed? Is it just when starting a new Task?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Wait I might be completely off.
Is this line "hangs" ?

task.execute_remotely(..., exit_process=True)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

We are running on a 1gbps backend.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

We are using k8s glue to spawn the job. Would you be able to advise in detail of steps on what goes on when the above code executes?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

SubstantialElk6 is this the issue ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi SubstantialElk6

32 CPU cores, 64GB ram

Should be plenty, this sounds like network bottle neck issue, I can't imagine the server is actually CPU bounded

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi, i will have to get back to you again. Need to check every client's repo to determine your hypothesis.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

If the only issue is this line
task.execute_remotely(..., exit_process=True)It has to finish the static analysis of the entire repository (which usually happens in the background but now we have to wait for it). If the repo is large this could actually take 20sec (depending on CPU/drive of the machine itself)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The server is running only the ClearML components. Could you advise on the ELB part, how should we optimise it?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

no worries

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

10 Answers

4 years ago

2 years ago