Hey Can Anyone Explain This, My Max Number Of Instances Is 5 But Its Showing Something Like That: Its Showing 8 Instance Which Does Not Make Sense,

same config no change.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FloppySwallow46
				
					0
					 × 1

I am on GCP rn, not working with AWS

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FloppySwallow46
				
					0
					 × 1

2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - --- Cloud instances (8) ---
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1110204948425405426, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1364006518840029853, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4551653386764087872, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4704932875408438200, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4875556593045271512, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 8497507852406420461, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 8578622755082531742, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 8829624801389113929, regular
2023-05-22 11:14:25,698 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-05-22 16:15:06
2023-05-22 11:14:35,070 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/8829624801389113929/8829624801389113929.txt (429): 
2023-05-22 11:14:35,070 - clearml.metrics - WARNING - Failed uploading to

 (Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/8829624801389113929/8829624801389113929.txt (429): )
2023-05-22 11:14:35,070 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed
2023-05-22 11:14:40,149 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/1110204948425405426/1110204948425405426.txt (429): 
2023-05-22 11:14:40,149 - clearml.metrics - WARNING - Failed uploading to

 (Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/1110204948425405426/1110204948425405426.txt (429): )
2023-05-22 11:14:40,150 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed
2023-05-22 16:16:14
2023-05-22 11:15:15,200 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/8578622755082531742/8578622755082531742.txt (429): 
2023-05-22 11:15:15,200 - clearml.metrics - WARNING - Failed uploading to

 (Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/8578622755082531742/8578622755082531742.txt (429): )
2023-05-22 11:15:15,201 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed
2023-05-22 11:15:18,434 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-05-22 11:15:20,248 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/4875556593045271512/4875556593045271512.txt (429): 
2023-05-22 11:15:20,248 - clearml.metrics - WARNING - Failed uploading to

 (Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/4875556593045271512/4875556593045271512.txt (429): )
2023-05-22 11:15:20,249 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed
2023-05-22 16:17:26
2023-05-22 11:16:25,335 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/4704932875408438200/4704932875408438200.txt (429): 
2023-05-22 11:16:25,336 - clearml.metrics - WARNING - Failed uploading to

 (Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/4704932875408438200/4704932875408438200.txt (429): )
2023-05-22 11:16:25,336 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed
2023-05-22 16:18:01
2023-05-22 11:17:35,401 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/4551653386764087872/4551653386764087872.txt (429): 
2023-05-22 11:17:35,401 - clearml.metrics - WARNING - Failed uploading to

 (Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/4551653386764087872/4551653386764087872.txt (429): )
2023-05-22 11:17:35,402 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed
2023-05-22 16:18:27
2023-05-22 11:18:24,727 - usage_reporter - INFO - Sending usage report for 186 usage seconds, 1 units
2023-05-22 16:19:07
2023-05-22 11:18:50,496 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/4551653386764087872/4551653386764087872.txt (429): 
2023-05-22 11:18:50,496 - clearml.metrics - WARNING - Failed uploading to

 (Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/4551653386764087872/4551653386764087872.txt (429): )
2023-05-22 11:18:50,497 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed
2023-05-22 16:20:14
2023-05-22 11:19:15,562 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/8497507852406420461/8497507852406420461.txt (429): 
2023-05-22 11:19:15,562 - clearml.metrics - WARNING - Failed uploading to

 (Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/8497507852406420461/8497507852406420461.txt (429): )
2023-05-22 11:19:15,563 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed
2023-05-22 11:19:24,756 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-05-22 16:21:27
2023-05-22 11:20:25,636 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/8829624801389113929/8829624801389113929.txt (429): 
2023-05-22 11:20:25,636 - clearml.metrics - WARNING - Failed uploading to

 (Failed uploading object /_ApplicationInstances/gcp-autoscaler/autoscaler_pt1.7.b66b711cc90547bda5caf6aa0b508d77/artifacts/8829624801389113929/8829624801389113929.txt (429): )
2023-05-22 11:20:25,637 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FloppySwallow46
				
					0
					 × 1

@<1570583227918192640:profile|FloppySwallow46> can you check the instances in the AWS dashboard? is it possible they are stuck and the autoscaler cannot communicate with them?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

We limit the allowed calls per IP - to make sure the server is not blocked accidentally. We enabled over 1000 calls per minute.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CumbersomeCormorant74
				
					0

How did the tasks fail?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

as for the limit issue, can you please include the complete autoscaler log?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I remove credentials from the config for security purpose, and replaced it with XYZ

{
    "gcp_project_id": "XYZ",
    "gcp_zone": "us-central1-b",
    "gcp_credentials": "XYZ",
    "git_user": "mkerrig",
    "git_pass": "XYZ",
    "default_docker_image": "pytorch/pytorch:1.7.0-cuda11.0-cudnn8-runtime",
    "instance_queue_list": [
        {
            "resource_name": "v100",
            "machine_type": "n1-highmem-4",
            "cpu_only": false,
            "gpu_type": "nvidia-tesla-v100",
            "gpu_count": 1,
            "preemptible": false,
            "regular_instance_rollback": false,
            "regular_instance_rollback_timeout": 10,
            "spot_instance_blackout_period": 0,
            "num_instances": 5,
            "queue_name": "Training-V100-16",
            "source_image": "projects/ml-images/global/images/c2-deeplearning-pytorch-1-13-cu113-v20230412-debian-10-py37",
            "disk_size_gb": 100,
            "service_account_email": "default"
        }
    ],
    "name": "autoscaler_pt1.7",
    "max_idle_time_min": 5,
    "workers_prefix": "dynamic_gcp",
    "polling_interval_time_min": "1",
    "exclude_bashrc": false,
    "custom_script": "sudo apt update\nsudo apt install apt-transport-https curl gnupg-agent ca-certificates software-properties-common -y\ncurl -fsSL

 | sudo apt-key add -\nsudo add-apt-repository \"deb [arch=amd64]

 focal stable\"\nsudo apt install docker-ce docker-ce-cli containerd.io -y\nnewgrp docker\nsudo usermod -aG docker $USER\nsudo /opt/deeplearning/install-driver.sh\npip install --upgrade pip",
    "extra_clearml_conf": null
}

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FloppySwallow46
				
					0
					 × 1

strangely, I reset my task and qued them again, and its all working..

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FloppySwallow46
				
					0
					 × 1

isn't it strange that one of them working and other got failed? and also when the config says 5 instances why it spun up 8 instances? Any idea about it

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FloppySwallow46
				
					0
					 × 1

and also 4 of my task failed but the 5th one runs completely fine,

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FloppySwallow46
				
					0
					 × 1

I enqueued 5 task to this auto scaler, 4 of them failed but the 5th one is working as expected..those enqued task are the clone of completed task

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FloppySwallow46
				
					0
					 × 1

Hey, whats the rate limit?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FloppySwallow46
				
					0
					 × 1

Hi @<1570583227918192640:profile|FloppySwallow46> , can you please share the autoscaler configuration?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Can you also please share logs of the autoscaler?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I see lots of errors in the logs

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Oh, sorry, GCP dashboard than 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi @<1570583227918192640:profile|FloppySwallow46> . We've update the rate limits. Can you please check if the issue is still occurring?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CumbersomeCormorant74
				
					0

You can get tot he underlying task log by going to the autoscaler configuration in the UI, clicking the "mode details" link at the bottom and downloading the console log from the task window that will open

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

it spins up 8 instances even though my config says 5

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					FloppySwallow46
				
					0
					 × 1

The reason they failed can probably be identified in their logs

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

(also, did you include the complete log?)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Answers 22