
Reputation
@<1523701070390366208:profile|CostlyOstrich36> Thanks for the reply!
When spinning a machine manually with the same base image and running the task without autoscaler the issue is not happenning, only with instances that the autoscaler creates
@<1523701070390366208:profile|CostlyOstrich36>
Same problem here, migrated my autoscalers workload from AWS ec2 instances to GCP VMs with same base images and docker images on top, running the exact same tasks results around 45 minutes extra for task.connect() step when triggered from the GCP VMs compared to AWS instances that pass this step in the task in less than a minute.
Using the managed clearml server (app.clear.ml).
#MeToo
@<1523701070390366208:profile|CostlyOstrich36> @<1523701087100473344:profile|SuccessfulKoala55>
Is there anything else I can provide you to proceed with understanding the issues?
@<1523701070390366208:profile|CostlyOstrich36>
If I understand correctly than I believe that that's exactly what I did in the previous comment provided logs.
I have done also experiments with cases where i run task manually in the VM without clearml-agent listening to the queue, but directly running the task manually, and also the results and connects were fast.
From all 3 types of executions I've tested, the only problematic case with the long connect times is the one where I push tasks to ...
@<1523701087100473344:profile|SuccessfulKoala55> I tried it as you asked, it just makes tasks to fail and apparently 'DEBUG' is just not a valid value for the 'CLEARML_API_VERBOSE' field, and only true/false are valid values.
I did find another option which is valid and might be what you meant though, which is:
"-e=CLEARML_LOG_LEVEL=DEBUG"
I am providing you the logs for the new tests with this variable set, yet I am pretty sure it makes no difference in the logs, especially not with anythin...
@<1523701070390366208:profile|CostlyOstrich36>
We created a very basic and simple task to demonstrate the difference in times between task running from an autoscaler spinned up instance VS manual spinned up instance with clearml-agent,
the task code is as follows:
from clearml import Task
import time
mydict = {"a": 1, "b": 2}
task = Task.init(project_name="test", task_name="test_small_dict")
task.execute_remotely(queue_name="tomer_queue")
# measure the time the function executes
star...
@<1523701087100473344:profile|SuccessfulKoala55> I added the -e CLEARML_API_VERBOSE=true to the configurations like you asked, although I am not sure it made any changes to the actual logs.
I'm providing logs of the autoscaler that took ~20.8 seconds for a simple small dict connect
VS
the manual spinned clearml-agent listener on a manual created vm that took ~1.5 second