Reputation
Badges 1
25 × Eureka!Sorry dear, needed urgent help.
I am such a newbie, Sorry again.
Any who I solved by commenting my code and created bunch of clone task and uncomment the code line by line, the issue was related to task.connect being called multiple times I guess.
bash: line 1: 1031 Aborted (core dumped) NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id b7af4abc3c174a32a5612c5c27a7b523
2023-05-24 17:10:00
Process failed, exit code 134
ideally, how much should I wait, once I put my task in some que?
I cloned one of my task which was successfully completed, and I am trying to run it through the autoscaler by enqueing it. No idea why its been crashing again and again,
I used the same config 1 month ago, but its not working today, I have run 100s of experiment on the auto scaler, but its not working now, no idea why.
it's odd to me as well, what's strange is that, the error was messaged was related to Nvidia and that wasn't the case as well. Anywho, its solved I am happy!!
@<1571670393394040832:profile|LittleDolphin60>
and also 4 of my task failed but the 5th one runs completely fine,
it spins up 8 instances even though my config says 5
I enqueued 5 task to this auto scaler, 4 of them failed but the 5th one is working as expected..those enqued task are the clone of completed task
I am on GCP rn, not working with AWS
Hey, whats the rate limit?
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - --- Cloud instances (8) ---
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1110204948425405426, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 1364006518840029853, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4551653386764087872, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 4704932875408438200, regular
2023-05-22 11:14:25,238 - clearml.Auto-Scaler - INFO - 48755565930452715...
strangely, I reset my task and qued them again, and its all working..
I remove credentials from the config for security purpose, and replaced it with XYZ
{
"gcp_project_id": "XYZ",
"gcp_zone": "us-central1-b",
"gcp_credentials": "XYZ",
"git_user": "mkerrig",
"git_pass": "XYZ",
"default_docker_image": "pytorch/pytorch:1.7.0-cuda11.0-cudnn8-runtime",
"instance_queue_list": [
{
"resource_name": "v100",
"machine_type": "n1-highmem-4",
"cpu_only": false,
"gpu_type": "nvidia-tes...
isn't it strange that one of them working and other got failed? and also when the config says 5 instances why it spun up 8 instances? Any idea about it
same config no change.
@<1571670393394040832:profile|LittleDolphin60>
its GCP..I spin up a box through auto scaller
I can see my process being crashed, I need to know why..if this is the clone of a completed task, then why it produces such error?