Hi FierceHamster54 , is this an old autoscaler instance? What is the version? You can see the version when you're on the application and click on 'More' at the top left text area
This is an instance than I launched like last week and was running fine until now, the version is v1.6.0-335
And the config is:{ "gcp_project_id": "XXXX", "gcp_zone": "europe-west1-b", "gcp_credentials": "XXXX", "git_user": "XXXX", "git_pass": "XXXXX", "default_docker_image": "XXXX", "instance_queue_list": [ { "resource_name": "gcp4cpu", "machine_type": "c2-standard-4", "cpu_only": true, "gpu_type": "", "gpu_count": 1, "preemptible": false, "regular_instance_rollback": false, "regular_instance_rollback_timeout": 10, "spot_instance_blackout_period": 0, "num_instances": 4, "queue_name": "Quad_VCPU_16GB", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20220902", "disk_size_gb": 100, "service_account_email": "default" } ], "name": "GCP 4VCPU 16GB Autoscaler v9", "max_idle_time_min": 15, "workers_prefix": "dynamic_clearml_cpu", "polling_interval_time_min": "1", "exclude_bashrc": false, "custom_script": "#!/bin/sh\n\nsudo apt-get update -y\n\nsudo apt-get install -y \\\n ca-certificates \\\n curl \\\n gnupg \\\n lsb-release\n\nsudo mkdir -p /etc/apt/keyrings\n\ncurl -fsSL
| sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg\n\necho \\\n \"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg]
\\\n $(lsb_release -cs) stable\" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null\n\nsudo apt-get update -y\n\nsudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin\n\nexport CLEARML_API_ACCESS_KEY=XXXX\nexport CLEARML_API_SECRET_KEY=XXXX\n\necho \"AWS_KEY_ID=XXXX\" > ./xhr_env.list\necho \"AWS_KEY_SECRET=XXXX\" >> ./xhr_env.list", "extra_clearml_conf": "sdk.aws.s3 = {\n region: \"aws-global\"\n key: \"XXXXX\"\n secret: \"XXXX\"\n use_credentials_chain: false\n extra_args: {}\n credentials: []\n}\nagent.package_manager.pip_version: \"<21\"\nagent.venvs_cache.path: ~/.clearml/venvs-cache" }
Thanks for the info! This happened when you had 2 spot instances running something, correct?
Do you know what was the state of the experiments at the time?
I was launching a pipeline run, but I don't remember having set the autoscaler to use spot instances (I believe the GCP terminology for spot instance is "preemptible" and I set it to false)
And there was still 2 instances running from the last pipeline run
Can you please add a larger chunk of the autoscaler log?
Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: (Reading database ... #015(Reading database ... 5%#015(Reading database ... 10%#015(Reading database ... 15%#015(Reading database ... 20%#015(Reading database ... 25%#015(Reading database ... 30%#015(Reading database ... 35%#015(Reading database ... 40%#015(Reading database ... 45%#015(Reading database ... 50%#015(Reading database ... 55%#015(Reading database ... 60%#015(Reading database ... 65%#015(Reading database ... 70%#015(Reading database ... 75%#015(Reading database ... 80%#015(Reading database ... 85%#015(Reading database ... 90%#015(Reading database ... 95%#015(Reading database ... 100%#015(Reading database ... 70853 files and directories currently installed.) Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Preparing to unpack .../0-pigz_2.6-1_amd64.deb ... Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Unpacking pigz (2.6-1) ... Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Selecting previously unselected package contain 2022-10-24 14:13:04 erd.io. Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Preparing to unpack .../1-containerd.io_1.6.8-1_amd64.deb ... Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Unpacking containerd.io (1.6.8-1) ... Oct 24 12:12:53 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Selecting previously unselected package docker-ce-cli. Oct 24 12:12:53 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Preparing to unpack .../2-docker-ce-cli_5%3a20.10.20~3-0~ubuntu-jammy_amd64.deb ... Oct 24 12:12:53 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Unpacking docker-ce-cli (5:20.10.20~3-0~ubuntu-jammy) ... Oct 24 12:12:56 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Selecting previously unselected package docker-ce. Oct 24 12:12:56 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Preparing to unpack .../3-docker-ce_5%3a20.10.20~3-0~ubuntu-jammy_amd64.deb ... Oct 24 12:12:56 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Unpacking docker-ce (5:20.10.20~3-0~ubuntu-jammy) ... 2022-10-24 14:13:21 2022-10-24 12:13:18,602 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units 2022-10-24 14:13:47 2022-10-24 12:13:45,873 - clearml.Auto-Scaler - INFO - Adding 'dynamic_clearml_cpu:gcp4cpu:c2-standard-4:9144964963922296814' to previous workers 2022-10-24 12:13:45,908 - clearml.Auto-Scaler - INFO - Found 0 tasks in queue 'Quad_VCPU_16GB' 2022-10-24 12:13:46,084 - clearml.Auto-Scaler - INFO - --- Cloud instances (2) --- 2022-10-24 12:13:46,085 - clearml.Auto-Scaler - INFO - gcp4cpu, 8203432257746845348, regular 2022-10-24 12:13:46,085 - clearml.Auto-Scaler - INFO - gcp4cpu, 9144964963922296814, regular 2022-10-24 12:13:46,188 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds 2022-10-24 14:13:52 2022-10-24 12:13:51,589 - clearml.Auto-Scaler - INFO - monitor_spots: 8203432257746845348 is alive 2022-10-24 12:13:51,760 - clearml.Auto-Scaler - INFO - monitor_spots: 9144964963922296814 is alive 2022-10-24 14:13:57 malloc(): unsorted double linked list corrupted 2022-10-24 14:13:57 Process failed, exit code -6
Hey CostlyOstrich36 I got another occurence of autoscaler crash with a similar backtrace, any updates on this issue?2022-11-04 11:46:55 2022-11-04 10:46:51,644 - clearml.Auto-Scaler - INFO - 5839398111025911016 console log: Starting Cleanup of Temporary Directories... Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: Starting Cleanup of Temporary Directories... Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully. [ OK ] Finished Cleanup of Temporary Directories. Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: Finished Cleanup of Temporary Directories. 2022-11-04 11:47:46 2022-11-04 10:47:41,480 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units 2022-11-04 10:47:43,266 - clearml.Auto-Scaler - INFO - Found 0 tasks in queue 'Quad_VCPU_16GB' 2022-11-04 10:47:43,444 - clearml.Auto-Scaler - INFO - --- Cloud instances (2) --- 2022-11-04 10:47:43,444 - clearml.Auto-Scaler - INFO - gcp4cpu, 5839398111025911016, regular 2022-11-04 10:47:43,444 - clearml.Auto-Scaler - INFO - gcp4cpu, 6043599831443265530, regular 2022-11-04 10:47:43,645 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds 2022-11-04 10:47:45,831 - clearml.Auto-Scaler - INFO - monitor_spots: 6043599831443265530 is alive 2022-11-04 10:47:46,012 - clearml.Auto-Scaler - INFO - monitor_spots: 5839398111025911016 is alive 2022-11-04 11:47:56 2022-11-04 10:47:52,353 - clearml.Auto-Scaler - WARNING - Can not get console logs from instance 6043599831443265530.Reason: [SSL: BLOCK_CIPHER_PAD_IS_WRONG] block cipher pad is wrong (_ssl.c:2635) malloc(): unsorted double linked list corrupted 2022-11-04 11:47:56 Process failed, exit code -6
Another crash on the same autoscaler instance:2022-11-04 15:53:54 2022-11-04 14:53:50,393 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units 2022-11-04 14:53:51,092 - clearml.Auto-Scaler - INFO - 2415066998557416558 console log: Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd[1]: var-lib-docker-overlay2-b04bca4c99cf94c31a3644236d70727aaa417fa4122e1b6c012e0ad908af24ef\x2dinit-merged.mount: Deactivated successfully. Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.357552] docker0: port 1(vetha6fafde) entered blocking state Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.357557] docker0: port 1(vetha6fafde) entered disabled state Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.357708] device vetha6fafde entered promiscuous mode Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-networkd[470]: vetha6fafde: Link UP Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-udevd[7712]: Using default interface naming scheme 'v249'. Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-udevd[7714]: Using default interface naming scheme 'v249'. Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b networkd-dispatcher[618]: WARNING:Unknown index 6 seen, reloading interface list Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b containerd[4030]: time="2022-11-04T14:53:29.995721256Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1 Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b containerd[4030]: time="2022-11-04T14:53:29.995779219Z" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1 Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b containerd[4030]: time="2022-11-04T14:53:29.995788130Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1 Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b containerd[4030]: time="2022-11-04T14:53:29.995942417Z" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/38ba08db210d005bf36348dc4702ba54b068dd16054fb47248b8e3d8e34da95e pid=7737 runtime=io.containerd.runc.v2 [ OK ] Started libcontainer conta…8dd16054fb47248b8e3d8e34da95e. Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd[1]: Started libcontainer container 38ba08db210d005bf36348dc4702ba54b068dd16054fb47248b8e3d8e34da95e. Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.590266] eth0: renamed from veth79464f9 Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-networkd[470]: vetha6fafde: Gained carrier Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-networkd[470]: docker0: Gained carrier Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.610001] IPv6: ADDRCONF(NETDEV_CHANGE): vetha6fafde: link becomes ready Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.610041] docker0: port 1(vetha6fafde) entered blocking state Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.610044] docker0: port 1(vetha6fafde) entered forwarding state Nov 4 14:53:31 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-networkd[470]: vetha6fafde: Gained IPv6LL 2022-11-04 15:53:59 2022-11-04 14:53:58,099 - clearml.Auto-Scaler - INFO - monitor_spots: 4423912289108270944 is alive 2022-11-04 14:53:58,309 - clearml.Auto-Scaler - INFO - monitor_spots: 2415066998557416558 is alive 2022-11-04 15:54:44 2022-11-04 14:54:43,724 - clearml.Auto-Scaler - INFO - Found 0 tasks in queue 'Quad_VCPU_16GB' 2022-11-04 14:54:43,901 - clearml.Auto-Scaler - INFO - --- Cloud instances (2) --- 2022-11-04 14:54:43,901 - clearml.Auto-Scaler - INFO - gcp4cpu, 2415066998557416558, regular 2022-11-04 14:54:43,901 - clearml.Auto-Scaler - INFO - gcp4cpu, 4423912289108270944, regular 2022-11-04 14:54:43,998 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds 2022-11-04 15:54:54 2022-11-04 14:54:50,428 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units 2022-11-04 14:54:51,165 - clearml.Auto-Scaler - WARNING - Can not get console logs from instance 4423912289108270944.Reason: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2635) malloc(): unsorted double linked list corrupted 2022-11-04 15:54:54 Process failed, exit code -6
I got the same issue as well last night.