Reputation
Badges 1
25 × Eureka!Any who I solved by commenting my code and created bunch of clone task and uncomment the code line by line, the issue was related to task.connect being called multiple times I guess.
I remove credentials from the config for security purpose, and replaced it with XYZ
{
"gcp_project_id": "XYZ",
"gcp_zone": "us-central1-b",
"gcp_credentials": "XYZ",
"git_user": "mkerrig",
"git_pass": "XYZ",
"default_docker_image": "pytorch/pytorch:1.7.0-cuda11.0-cudnn8-runtime",
"instance_queue_list": [
{
"resource_name": "v100",
"machine_type": "n1-highmem-4",
"cpu_only": false,
"gpu_type": "nvidia-tes...
I enqueued 5 task to this auto scaler, 4 of them failed but the 5th one is working as expected..those enqued task are the clone of completed task
same config no change.
Sorry dear, needed urgent help.
it spins up 8 instances even though my config says 5
isn't it strange that one of them working and other got failed? and also when the config says 5 instances why it spun up 8 instances? Any idea about it
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - --- Cloud instances (8) ---
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1110204948425405426, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1364006518840029853, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4551653386764087872, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4704932875408438200, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 48755565930452715...
and also 4 of my task failed but the 5th one runs completely fine,
@<1571670393394040832:profile|LittleDolphin60>
I am such a newbie, Sorry again.
its GCP..I spin up a box through auto scaller
strangely, I reset my task and qued them again, and its all working..
@<1571670393394040832:profile|LittleDolphin60>
I used the same config 1 month ago, but its not working today, I have run 100s of experiment on the auto scaler, but its not working now, no idea why.
ideally, how much should I wait, once I put my task in some que?
I cloned one of my task which was successfully completed, and I am trying to run it through the autoscaler by enqueing it. No idea why its been crashing again and again,
bash: line 1: 1031 Aborted (core dumped) NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id b7af4abc3c174a32a5612c5c27a7b523
2023-05-24 17:10:00
Process failed, exit code 134
it's odd to me as well, what's strange is that, the error was messaged was related to Nvidia and that wasn't the case as well. Anywho, its solved I am happy!!
I can see my process being crashed, I need to know why..if this is the clone of a completed task, then why it produces such error?
Hey, whats the rate limit?
I am on GCP rn, not working with AWS