Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey, I'M Using Clearml Gcp Autoscaler And It Seems That

hey, I'm using clearml GCP autoscaler and it seems that task.connect is very slow compared to the same setup in clearml AWS autoscalers. it might takes up to an hour sometimes to complete the function call. is there any reason for that?

  
  
Posted 22 days ago
Votes Newest

Answers 14


ElatedRaven55 , what if you manually spin up the agent on the manually spun machine and then push the experiment for execution from there?

  
  
Posted 22 days ago

CostlyOstrich36
If I understand correctly than I believe that that's exactly what I did in the previous comment provided logs.
I have done also experiments with cases where i run task manually in the VM without clearml-agent listening to the queue, but directly running the task manually, and also the results and connects were fast.
From all 3 types of executions I've tested, the only problematic case with the long connect times is the one where I push tasks to the queue and the autoscaler is listening and spinning up VMs automatically, and these time ratios are consistent on every test.

  
  
Posted 22 days ago

ElatedRaven55 can you please try again but this time set CLEARML_API_VERBOSE=DEBUG ?

  
  
Posted 10 days ago

CostlyOstrich36
We created a very basic and simple task to demonstrate the difference in times between task running from an autoscaler spinned up instance VS manual spinned up instance with clearml-agent,
the task code is as follows:

from clearml import Task
import time

mydict = {"a": 1, "b": 2}
task = Task.init(project_name="test", task_name="test_small_dict")
task.execute_remotely(queue_name="tomer_queue")

# measure the time the function executes
start = time.time()
task.connect(mydict)
end = time.time()
print("Time elapsed: ", end - start)

the instance from autoscaler took 13.647857427597046 seconds
VS
manual clearml-agent took 1.5556252002716064 seconds

that's around ~10 times slower!

I'm providing the full logs of both experiments.

  
  
Posted 22 days ago

CostlyOstrich36 Thanks for the reply!
When spinning a machine manually with the same base image and running the task without autoscaler the issue is not happenning, only with instances that the autoscaler creates

  
  
Posted 22 days ago

Hi DangerousBee35 , it sounds like some sort of network lag. I assume you are using app.clear.ml?

I'd check network latency from the instances starting in GCP to the server.

  
  
Posted 22 days ago

CostlyOstrich36
Same problem here, migrated my autoscalers workload from AWS ec2 instances to GCP VMs with same base images and docker images on top, running the exact same tasks results around 45 minutes extra for task.connect() step when triggered from the GCP VMs compared to AWS instances that pass this step in the task in less than a minute.
Using the managed clearml server (app.clear.ml).
#MeToo

  
  
Posted 22 days ago

Can you provide a full log of the VM when spun manually vs when by an autoscaler? Also I'd try spinning up manually a VM and then running an agent manually on it and see if the issue reproduces

  
  
Posted 22 days ago

CostlyOstrich36 thanks for the reply!

yes i'm using app.clear.ml
the vm is initialized via the clearml autoscalers, in the aws autoscaler i didn't have to do some network configurations there, thus i assume that it should be the same in the gcp VMs.

can you direct me to tests that should reveal lagging issues?

  
  
Posted 22 days ago

Sharing the same workspace so it makes sense that you'd encounter the same issue being on the same network 🙂

ElatedRaven55 , If you manually spin up the machines, does the issue reproduce? Did you try running the same exact VM setup manually?
DangerousBee35 , I'd ask the DevOps to check if there might be something slowing communication from your new network in GCP to the app.clear.ml server

  
  
Posted 22 days ago

Hi ElatedRaven55 , in order to get more visibility for the API calls that seem to take longer, can you please try to run both experiments with the env var CLEARML_API_VERBOSE=true set for the container running the experiment? (the easiest way would probably be to add -e CLEARML_API_VERBOSE=true to the extra arguments for the task's container settings)

  
  
Posted 18 days ago

CostlyOstrich36 SuccessfulKoala55
Is there anything else I can provide you to proceed with understanding the issues?

  
  
Posted 12 days ago

SuccessfulKoala55 I tried it as you asked, it just makes tasks to fail and apparently 'DEBUG' is just not a valid value for the 'CLEARML_API_VERBOSE' field, and only true/false are valid values.
I did find another option which is valid and might be what you meant though, which is:
"-e=CLEARML_LOG_LEVEL=DEBUG"
I am providing you the logs for the new tests with this variable set, yet I am pretty sure it makes no difference in the logs, especially not with anything related to our issue with the task.connect() function (there is no any added prints/logs around the execution of the task.connect() function)
I'm providing logs of the autoscaler that took ~21.8 seconds for a simple small dict connect
VS
the manual spinned clearml-agent listener on a manual created vm that took ~1.4 second

  
  
Posted 9 days ago

SuccessfulKoala55 I added the -e CLEARML_API_VERBOSE=true to the configurations like you asked, although I am not sure it made any changes to the actual logs.
I'm providing logs of the autoscaler that took ~20.8 seconds for a simple small dict connect
VS
the manual spinned clearml-agent listener on a manual created vm that took ~1.5 second

  
  
Posted 18 days ago