
Reputation
Badges 1
25 × Eureka!Hey, whats the rate limit?
it's odd to me as well, what's strange is that, the error was messaged was related to Nvidia and that wasn't the case as well. Anywho, its solved I am happy!!
I am such a newbie, Sorry again.
and also 4 of my task failed but the 5th one runs completely fine,
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - --- Cloud instances (8) ---
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1110204948425405426, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1364006518840029853, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4551653386764087872, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4704932875408438200, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 48755565930452715...
bash: line 1: 1031 Aborted (core dumped) NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id b7af4abc3c174a32a5612c5c27a7b523
2023-05-24 17:10:00
Process failed, exit code 134
it spins up 8 instances even though my config says 5
I remove credentials from the config for security purpose, and replaced it with XYZ
{
"gcp_project_id": "XYZ",
"gcp_zone": "us-central1-b",
"gcp_credentials": "XYZ",
"git_user": "mkerrig",
"git_pass": "XYZ",
"default_docker_image": "pytorch/pytorch:1.7.0-cuda11.0-cudnn8-runtime",
"instance_queue_list": [
{
"resource_name": "v100",
"machine_type": "n1-highmem-4",
"cpu_only": false,
"gpu_type": "nvidia-tes...
same config no change.
I cloned one of my task which was successfully completed, and I am trying to run it through the autoscaler by enqueing it. No idea why its been crashing again and again,
its GCP..I spin up a box through auto scaller
@<1571670393394040832:profile|LittleDolphin60>
I used the same config 1 month ago, but its not working today, I have run 100s of experiment on the auto scaler, but its not working now, no idea why.
I am on GCP rn, not working with AWS
I can see my process being crashed, I need to know why..if this is the clone of a completed task, then why it produces such error?
I enqueued 5 task to this auto scaler, 4 of them failed but the 5th one is working as expected..those enqued task are the clone of completed task
Any who I solved by commenting my code and created bunch of clone task and uncomment the code line by line, the issue was related to task.connect being called multiple times I guess.
ideally, how much should I wait, once I put my task in some que?
isn't it strange that one of them working and other got failed? and also when the config says 5 instances why it spun up 8 instances? Any idea about it
strangely, I reset my task and qued them again, and its all working..
Sorry dear, needed urgent help.
@<1571670393394040832:profile|LittleDolphin60>