Hey, I'M Using Clearml Gcp Autoscaler And It Seems That

Answered

hey, I'm using clearml GCP autoscaler and it seems that task.connect is very slow compared to the same setup in clearml AWS autoscalers. it might takes up to an hour sometimes to complete the function call. is there any reason for that?

  				
Posted 
	22 days ago

					More  		
  Report
		
					DangerousBee35
				
					0
					 × 1

Votes Newest

Answers 14

ElatedRaven55 , what if you manually spin up the agent on the manually spun machine and then push the experiment for execution from there?

  				
Posted 
	22 days ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

CostlyOstrich36
If I understand correctly than I believe that that's exactly what I did in the previous comment provided logs.
I have done also experiments with cases where i run task manually in the VM without clearml-agent listening to the queue, but directly running the task manually, and also the results and connects were fast.
From all 3 types of executions I've tested, the only problematic case with the long connect times is the one where I push tasks to the queue and the autoscaler is listening and spinning up VMs automatically, and these time ratios are consistent on every test.

  				
Posted 
	22 days ago

					More  		
  Report
		
					ElatedRaven55
				
					0

ElatedRaven55 can you please try again but this time set CLEARML_API_VERBOSE=DEBUG ?

  				
Posted 
	10 days ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

CostlyOstrich36
We created a very basic and simple task to demonstrate the difference in times between task running from an autoscaler spinned up instance VS manual spinned up instance with clearml-agent,
the task code is as follows:

from clearml import Task
import time

mydict = {"a": 1, "b": 2}
task = Task.init(project_name="test", task_name="test_small_dict")
task.execute_remotely(queue_name="tomer_queue")

# measure the time the function executes
start = time.time()
task.connect(mydict)
end = time.time()
print("Time elapsed: ", end - start)

the instance from autoscaler took 13.647857427597046 seconds
VS
manual clearml-agent took 1.5556252002716064 seconds

that's around ~10 times slower!

I'm providing the full logs of both experiments.

  				
Posted 
	22 days ago

					More  		
  Report
		
					ElatedRaven55
				
					0

CostlyOstrich36 Thanks for the reply!
When spinning a machine manually with the same base image and running the task without autoscaler the issue is not happenning, only with instances that the autoscaler creates

  				
Posted 
	22 days ago

					More  		
  Report
		
					ElatedRaven55
				
					0

Hi DangerousBee35 , it sounds like some sort of network lag. I assume you are using app.clear.ml?

I'd check network latency from the instances starting in GCP to the server.

  				
Posted 
	22 days ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

CostlyOstrich36
Same problem here, migrated my autoscalers workload from AWS ec2 instances to GCP VMs with same base images and docker images on top, running the exact same tasks results around 45 minutes extra for task.connect() step when triggered from the GCP VMs compared to AWS instances that pass this step in the task in less than a minute.
Using the managed clearml server (app.clear.ml).
#MeToo

  				
Posted 
	22 days ago

					More  		
  Report
		
					ElatedRaven55
				
					0

Can you provide a full log of the VM when spun manually vs when by an autoscaler? Also I'd try spinning up manually a VM and then running an agent manually on it and see if the issue reproduces

  				
Posted 
	22 days ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

CostlyOstrich36 thanks for the reply!

yes i'm using app.clear.ml
the vm is initialized via the clearml autoscalers, in the aws autoscaler i didn't have to do some network configurations there, thus i assume that it should be the same in the gcp VMs.

can you direct me to tests that should reveal lagging issues?

  				
Posted 
	22 days ago

					More  		
  Report
		
					DangerousBee35
				
					0
					 × 1

Sharing the same workspace so it makes sense that you'd encounter the same issue being on the same network 🙂

ElatedRaven55 , If you manually spin up the machines, does the issue reproduce? Did you try running the same exact VM setup manually?
DangerousBee35 , I'd ask the DevOps to check if there might be something slowing communication from your new network in GCP to the app.clear.ml server

  				
Posted 
	22 days ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Hi ElatedRaven55 , in order to get more visibility for the API calls that seem to take longer, can you please try to run both experiments with the env var CLEARML_API_VERBOSE=true set for the container running the experiment? (the easiest way would probably be to add -e CLEARML_API_VERBOSE=true to the extra arguments for the task's container settings)

  				
Posted 
	18 days ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

CostlyOstrich36 SuccessfulKoala55
Is there anything else I can provide you to proceed with understanding the issues?

  				
Posted 
	12 days ago

					More  		
  Report
		
					ElatedRaven55
				
					0

SuccessfulKoala55 I tried it as you asked, it just makes tasks to fail and apparently 'DEBUG' is just not a valid value for the 'CLEARML_API_VERBOSE' field, and only true/false are valid values.
I did find another option which is valid and might be what you meant though, which is:
"-e=CLEARML_LOG_LEVEL=DEBUG"
I am providing you the logs for the new tests with this variable set, yet I am pretty sure it makes no difference in the logs, especially not with anything related to our issue with the task.connect() function (there is no any added prints/logs around the execution of the task.connect() function)
I'm providing logs of the autoscaler that took ~21.8 seconds for a simple small dict connect
VS
the manual spinned clearml-agent listener on a manual created vm that took ~1.4 second

  				
Posted 
	9 days ago

					More  		
  Report
		
					ElatedRaven55
				
					0

SuccessfulKoala55 I added the -e CLEARML_API_VERBOSE=true to the configurations like you asked, although I am not sure it made any changes to the actual logs.
I'm providing logs of the autoscaler that took ~20.8 seconds for a simple small dict connect
VS
the manual spinned clearml-agent listener on a manual created vm that took ~1.5 second

  				
Posted 
	18 days ago

					More  		
  Report
		
					ElatedRaven55
				
					0

Write your answer

108 Views

14 Answers

22 days ago

9 days ago