setting ignore_remote_overrides = True
help solve the issue, but obviously we can't use it as a solution. what reasons might be that it would take so much time when trying to find params override in the backend? is it a network issue? maybe needs to change the machine network configuration?
Answered
Hi, We Are Migrating From Aws To Gcp Machines And We Experience Issues With
Hi, we are migrating from AWS to GCP machines and we experience issues with task.connect
function. the issue is that on GCP machines that are spawn by the autoscaler (clearml gcp autoscaler) it takes a lot of time to complete the task.connect
, for example:
Connected config: experiment_globals in 37.61 seconds
Connected config: data in 74.52 seconds
Connected config: augmentations in 50.19 seconds
Connected config: model in 64.26 seconds
Connected config: losses in 26.28 seconds
Connected config: trainer in 155.20 seconds
Connected config: datasets_config in 70.18 seconds
Connected config: model_architecture in 4191.67 seconds
Connected config: losses_config in 1254.73 seconds
Connected config: trainer_config in 58.90 seconds
as you can see it takes hours to connect our configurations.
using a GCP VM that we spawn manually, with the same machine image, running not in remote mode , we don't have this issue, and task.connect
is done in a few seconds.
we would love to get some ideas on what might cause this? what/where do we need to look at/for?
attached the full log file for your convenience.
83 Views
1
Answer
18 days ago
17 days ago
Tags
Similar posts