
Reputation
Badges 1
25 × Eureka!it spins up 8 instances even though my config says 5
I am on GCP rn, not working with AWS
I remove credentials from the config for security purpose, and replaced it with XYZ
{
"gcp_project_id": "XYZ",
"gcp_zone": "us-central1-b",
"gcp_credentials": "XYZ",
"git_user": "mkerrig",
"git_pass": "XYZ",
"default_docker_image": "pytorch/pytorch:1.7.0-cuda11.0-cudnn8-runtime",
"instance_queue_list": [
{
"resource_name": "v100",
"machine_type": "n1-highmem-4",
"cpu_only": false,
"gpu_type": "nvidia-tes...
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - --- Cloud instances (8) ---
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1110204948425405426, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1364006518840029853, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4551653386764087872, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4704932875408438200, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 48755565930452715...
and also 4 of my task failed but the 5th one runs completely fine,
I enqueued 5 task to this auto scaler, 4 of them failed but the 5th one is working as expected..those enqued task are the clone of completed task
I am such a newbie, Sorry again.
Hey, whats the rate limit?
Sorry dear, needed urgent help.
same config no change.
isn't it strange that one of them working and other got failed? and also when the config says 5 instances why it spun up 8 instances? Any idea about it
I cloned one of my task which was successfully completed, and I am trying to run it through the autoscaler by enqueing it. No idea why its been crashing again and again,
it's odd to me as well, what's strange is that, the error was messaged was related to Nvidia and that wasn't the case as well. Anywho, its solved I am happy!!
bash: line 1: 1031 Aborted (core dumped) NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id b7af4abc3c174a32a5612c5c27a7b523
2023-05-24 17:10:00
Process failed, exit code 134
Any who I solved by commenting my code and created bunch of clone task and uncomment the code line by line, the issue was related to task.connect being called multiple times I guess.
I can see my process being crashed, I need to know why..if this is the clone of a completed task, then why it produces such error?
I used the same config 1 month ago, but its not working today, I have run 100s of experiment on the auto scaler, but its not working now, no idea why.
its GCP..I spin up a box through auto scaller
@<1571670393394040832:profile|LittleDolphin60>
strangely, I reset my task and qued them again, and its all working..
@<1571670393394040832:profile|LittleDolphin60>
ideally, how much should I wait, once I put my task in some que?