ElatedRaven55 , what if you manually spin up the agent on the manually spun machine and then push the experiment for execution from there?
CostlyOstrich36
If I understand correctly than I believe that that's exactly what I did in the previous comment provided logs.
I have done also experiments with cases where i run task manually in the VM without clearml-agent listening to the queue, but directly running the task manually, and also the results and connects were fast.
From all 3 types of executions I've tested, the only problematic case with the long connect times is the one where I push tasks to the queue and the autoscaler is listening and spinning up VMs automatically, and these time ratios are consistent on every test.
ElatedRaven55 can you please try again but this time set CLEARML_API_VERBOSE=DEBUG ?
CostlyOstrich36
We created a very basic and simple task to demonstrate the difference in times between task running from an autoscaler spinned up instance VS manual spinned up instance with clearml-agent,
the task code is as follows:
from clearml import Task
import time
mydict = {"a": 1, "b": 2}
task = Task.init(project_name="test", task_name="test_small_dict")
task.execute_remotely(queue_name="tomer_queue")
# measure the time the function executes
start = time.time()
task.connect(mydict)
end = time.time()
print("Time elapsed: ", end - start)
the instance from autoscaler took 13.647857427597046 seconds
VS
manual clearml-agent took 1.5556252002716064 seconds
that's around ~10 times slower!
I'm providing the full logs of both experiments.
CostlyOstrich36 Thanks for the reply!
When spinning a machine manually with the same base image and running the task without autoscaler the issue is not happenning, only with instances that the autoscaler creates
Hi DangerousBee35 , it sounds like some sort of network lag. I assume you are using app.clear.ml?
I'd check network latency from the instances starting in GCP to the server.
CostlyOstrich36
Same problem here, migrated my autoscalers workload from AWS ec2 instances to GCP VMs with same base images and docker images on top, running the exact same tasks results around 45 minutes extra for task.connect() step when triggered from the GCP VMs compared to AWS instances that pass this step in the task in less than a minute.
Using the managed clearml server (app.clear.ml).
#MeToo
Can you provide a full log of the VM when spun manually vs when by an autoscaler? Also I'd try spinning up manually a VM and then running an agent manually on it and see if the issue reproduces
CostlyOstrich36 thanks for the reply!
yes i'm using app.clear.ml
the vm is initialized via the clearml autoscalers, in the aws autoscaler i didn't have to do some network configurations there, thus i assume that it should be the same in the gcp VMs.
can you direct me to tests that should reveal lagging issues?
Sharing the same workspace so it makes sense that you'd encounter the same issue being on the same network 🙂
ElatedRaven55 , If you manually spin up the machines, does the issue reproduce? Did you try running the same exact VM setup manually?
DangerousBee35 , I'd ask the DevOps to check if there might be something slowing communication from your new network in GCP to the app.clear.ml server
Hi ElatedRaven55 , in order to get more visibility for the API calls that seem to take longer, can you please try to run both experiments with the env var CLEARML_API_VERBOSE=true
set for the container running the experiment? (the easiest way would probably be to add -e CLEARML_API_VERBOSE=true
to the extra arguments for the task's container settings)
CostlyOstrich36 SuccessfulKoala55
Is there anything else I can provide you to proceed with understanding the issues?
SuccessfulKoala55 I tried it as you asked, it just makes tasks to fail and apparently 'DEBUG' is just not a valid value for the 'CLEARML_API_VERBOSE' field, and only true/false are valid values.
I did find another option which is valid and might be what you meant though, which is:
"-e=CLEARML_LOG_LEVEL=DEBUG"
I am providing you the logs for the new tests with this variable set, yet I am pretty sure it makes no difference in the logs, especially not with anything related to our issue with the task.connect() function (there is no any added prints/logs around the execution of the task.connect() function)
I'm providing logs of the autoscaler that took ~21.8 seconds for a simple small dict connect
VS
the manual spinned clearml-agent listener on a manual created vm that took ~1.4 second
SuccessfulKoala55 I added the -e CLEARML_API_VERBOSE=true to the configurations like you asked, although I am not sure it made any changes to the actual logs.
I'm providing logs of the autoscaler that took ~20.8 seconds for a simple small dict connect
VS
the manual spinned clearml-agent listener on a manual created vm that took ~1.5 second