
Reputation
Badges 1
25 × Eureka!bash: line 1: 1031 Aborted (core dumped) NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id b7af4abc3c174a32a5612c5c27a7b523
2023-05-24 17:10:00
Process failed, exit code 134
I remove credentials from the config for security purpose, and replaced it with XYZ
{
"gcp_project_id": "XYZ",
"gcp_zone": "us-central1-b",
"gcp_credentials": "XYZ",
"git_user": "mkerrig",
"git_pass": "XYZ",
"default_docker_image": "pytorch/pytorch:1.7.0-cuda11.0-cudnn8-runtime",
"instance_queue_list": [
{
"resource_name": "v100",
"machine_type": "n1-highmem-4",
"cpu_only": false,
"gpu_type": "nvidia-tes...
I enqueued 5 task to this auto scaler, 4 of them failed but the 5th one is working as expected..those enqued task are the clone of completed task
I used the same config 1 month ago, but its not working today, I have run 100s of experiment on the auto scaler, but its not working now, no idea why.
and also 4 of my task failed but the 5th one runs completely fine,
Any who I solved by commenting my code and created bunch of clone task and uncomment the code line by line, the issue was related to task.connect being called multiple times I guess.
strangely, I reset my task and qued them again, and its all working..
it spins up 8 instances even though my config says 5
isn't it strange that one of them working and other got failed? and also when the config says 5 instances why it spun up 8 instances? Any idea about it
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - --- Cloud instances (8) ---
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1110204948425405426, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1364006518840029853, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4551653386764087872, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4704932875408438200, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 48755565930452715...
it's odd to me as well, what's strange is that, the error was messaged was related to Nvidia and that wasn't the case as well. Anywho, its solved I am happy!!
I cloned one of my task which was successfully completed, and I am trying to run it through the autoscaler by enqueing it. No idea why its been crashing again and again,
Sorry dear, needed urgent help.
I can see my process being crashed, I need to know why..if this is the clone of a completed task, then why it produces such error?
I am such a newbie, Sorry again.
Hey, whats the rate limit?
I am on GCP rn, not working with AWS
same config no change.
@<1571670393394040832:profile|LittleDolphin60>
ideally, how much should I wait, once I put my task in some que?
its GCP..I spin up a box through auto scaller
@<1571670393394040832:profile|LittleDolphin60>