I got the same issue as well last night.
Another crash on the same autoscaler instance:2022-11-04 15:53:54 2022-11-04 14:53:50,393 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units 2022-11-04 14:53:51,092 - clearml.Auto-Scaler - INFO - 2415066998557416558 console log: Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd[1]: var-lib-docker-overlay2-b04bca4c99cf94c31a3644236d70727aaa417fa4122e1b6c012e0ad908af24ef\x2dinit-merged.mount: Deactivated successfully. Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.357552] docker0: port 1(vetha6fafde) entered blocking state Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.357557] docker0: port 1(vetha6fafde) entered disabled state Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.357708] device vetha6fafde entered promiscuous mode Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-networkd[470]: vetha6fafde: Link UP Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-udevd[7712]: Using default interface naming scheme 'v249'. Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-udevd[7714]: Using default interface naming scheme 'v249'. Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b networkd-dispatcher[618]: WARNING:Unknown index 6 seen, reloading interface list Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b containerd[4030]: time="2022-11-04T14:53:29.995721256Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1 Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b containerd[4030]: time="2022-11-04T14:53:29.995779219Z" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1 Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b containerd[4030]: time="2022-11-04T14:53:29.995788130Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1 Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b containerd[4030]: time="2022-11-04T14:53:29.995942417Z" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/38ba08db210d005bf36348dc4702ba54b068dd16054fb47248b8e3d8e34da95e pid=7737 runtime=io.containerd.runc.v2 [ OK ] Started libcontainer conta…8dd16054fb47248b8e3d8e34da95e. Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd[1]: Started libcontainer container 38ba08db210d005bf36348dc4702ba54b068dd16054fb47248b8e3d8e34da95e. Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.590266] eth0: renamed from veth79464f9 Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-networkd[470]: vetha6fafde: Gained carrier Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-networkd[470]: docker0: Gained carrier Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.610001] IPv6: ADDRCONF(NETDEV_CHANGE): vetha6fafde: link becomes ready Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.610041] docker0: port 1(vetha6fafde) entered blocking state Nov 4 14:53:30 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b kernel: [ 1245.610044] docker0: port 1(vetha6fafde) entered forwarding state Nov 4 14:53:31 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd-networkd[470]: vetha6fafde: Gained IPv6LL 2022-11-04 15:53:59 2022-11-04 14:53:58,099 - clearml.Auto-Scaler - INFO - monitor_spots: 4423912289108270944 is alive 2022-11-04 14:53:58,309 - clearml.Auto-Scaler - INFO - monitor_spots: 2415066998557416558 is alive 2022-11-04 15:54:44 2022-11-04 14:54:43,724 - clearml.Auto-Scaler - INFO - Found 0 tasks in queue 'Quad_VCPU_16GB' 2022-11-04 14:54:43,901 - clearml.Auto-Scaler - INFO - --- Cloud instances (2) --- 2022-11-04 14:54:43,901 - clearml.Auto-Scaler - INFO - gcp4cpu, 2415066998557416558, regular 2022-11-04 14:54:43,901 - clearml.Auto-Scaler - INFO - gcp4cpu, 4423912289108270944, regular 2022-11-04 14:54:43,998 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds 2022-11-04 15:54:54 2022-11-04 14:54:50,428 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units 2022-11-04 14:54:51,165 - clearml.Auto-Scaler - WARNING - Can not get console logs from instance 4423912289108270944.Reason: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2635) malloc(): unsorted double linked list corrupted 2022-11-04 15:54:54 Process failed, exit code -6
Hey CostlyOstrich36 I got another occurence of autoscaler crash with a similar backtrace, any updates on this issue?2022-11-04 11:46:55 2022-11-04 10:46:51,644 - clearml.Auto-Scaler - INFO - 5839398111025911016 console log: Starting Cleanup of Temporary Directories... Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: Starting Cleanup of Temporary Directories... Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully. [ OK ] Finished Cleanup of Temporary Directories. Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: Finished Cleanup of Temporary Directories. 2022-11-04 11:47:46 2022-11-04 10:47:41,480 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units 2022-11-04 10:47:43,266 - clearml.Auto-Scaler - INFO - Found 0 tasks in queue 'Quad_VCPU_16GB' 2022-11-04 10:47:43,444 - clearml.Auto-Scaler - INFO - --- Cloud instances (2) --- 2022-11-04 10:47:43,444 - clearml.Auto-Scaler - INFO - gcp4cpu, 5839398111025911016, regular 2022-11-04 10:47:43,444 - clearml.Auto-Scaler - INFO - gcp4cpu, 6043599831443265530, regular 2022-11-04 10:47:43,645 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds 2022-11-04 10:47:45,831 - clearml.Auto-Scaler - INFO - monitor_spots: 6043599831443265530 is alive 2022-11-04 10:47:46,012 - clearml.Auto-Scaler - INFO - monitor_spots: 5839398111025911016 is alive 2022-11-04 11:47:56 2022-11-04 10:47:52,353 - clearml.Auto-Scaler - WARNING - Can not get console logs from instance 6043599831443265530.Reason: [SSL: BLOCK_CIPHER_PAD_IS_WRONG] block cipher pad is wrong (_ssl.c:2635) malloc(): unsorted double linked list corrupted 2022-11-04 11:47:56 Process failed, exit code -6
Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: (Reading database ... #015(Reading database ... 5%#015(Reading database ... 10%#015(Reading database ... 15%#015(Reading database ... 20%#015(Reading database ... 25%#015(Reading database ... 30%#015(Reading database ... 35%#015(Reading database ... 40%#015(Reading database ... 45%#015(Reading database ... 50%#015(Reading database ... 55%#015(Reading database ... 60%#015(Reading database ... 65%#015(Reading database ... 70%#015(Reading database ... 75%#015(Reading database ... 80%#015(Reading database ... 85%#015(Reading database ... 90%#015(Reading database ... 95%#015(Reading database ... 100%#015(Reading database ... 70853 files and directories currently installed.) Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Preparing to unpack .../0-pigz_2.6-1_amd64.deb ... Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Unpacking pigz (2.6-1) ... Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Selecting previously unselected package contain 2022-10-24 14:13:04 erd.io. Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Preparing to unpack .../1-containerd.io_1.6.8-1_amd64.deb ... Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Unpacking containerd.io (1.6.8-1) ... Oct 24 12:12:53 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Selecting previously unselected package docker-ce-cli. Oct 24 12:12:53 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Preparing to unpack .../2-docker-ce-cli_5%3a20.10.20~3-0~ubuntu-jammy_amd64.deb ... Oct 24 12:12:53 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Unpacking docker-ce-cli (5:20.10.20~3-0~ubuntu-jammy) ... Oct 24 12:12:56 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Selecting previously unselected package docker-ce. Oct 24 12:12:56 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Preparing to unpack .../3-docker-ce_5%3a20.10.20~3-0~ubuntu-jammy_amd64.deb ... Oct 24 12:12:56 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: Unpacking docker-ce (5:20.10.20~3-0~ubuntu-jammy) ... 2022-10-24 14:13:21 2022-10-24 12:13:18,602 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units 2022-10-24 14:13:47 2022-10-24 12:13:45,873 - clearml.Auto-Scaler - INFO - Adding 'dynamic_clearml_cpu:gcp4cpu:c2-standard-4:9144964963922296814' to previous workers 2022-10-24 12:13:45,908 - clearml.Auto-Scaler - INFO - Found 0 tasks in queue 'Quad_VCPU_16GB' 2022-10-24 12:13:46,084 - clearml.Auto-Scaler - INFO - --- Cloud instances (2) --- 2022-10-24 12:13:46,085 - clearml.Auto-Scaler - INFO - gcp4cpu, 8203432257746845348, regular 2022-10-24 12:13:46,085 - clearml.Auto-Scaler - INFO - gcp4cpu, 9144964963922296814, regular 2022-10-24 12:13:46,188 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds 2022-10-24 14:13:52 2022-10-24 12:13:51,589 - clearml.Auto-Scaler - INFO - monitor_spots: 8203432257746845348 is alive 2022-10-24 12:13:51,760 - clearml.Auto-Scaler - INFO - monitor_spots: 9144964963922296814 is alive 2022-10-24 14:13:57 malloc(): unsorted double linked list corrupted 2022-10-24 14:13:57 Process failed, exit code -6
Can you please add a larger chunk of the autoscaler log?
And there was still 2 instances running from the last pipeline run
I was launching a pipeline run, but I don't remember having set the autoscaler to use spot instances (I believe the GCP terminology for spot instance is "preemptible" and I set it to false)
Do you know what was the state of the experiments at the time?
Thanks for the info! This happened when you had 2 spot instances running something, correct?
And the config is:{ "gcp_project_id": "XXXX", "gcp_zone": "europe-west1-b", "gcp_credentials": "XXXX", "git_user": "XXXX", "git_pass": "XXXXX", "default_docker_image": "XXXX", "instance_queue_list": [ { "resource_name": "gcp4cpu", "machine_type": "c2-standard-4", "cpu_only": true, "gpu_type": "", "gpu_count": 1, "preemptible": false, "regular_instance_rollback": false, "regular_instance_rollback_timeout": 10, "spot_instance_blackout_period": 0, "num_instances": 4, "queue_name": "Quad_VCPU_16GB", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20220902", "disk_size_gb": 100, "service_account_email": "default" } ], "name": "GCP 4VCPU 16GB Autoscaler v9", "max_idle_time_min": 15, "workers_prefix": "dynamic_clearml_cpu", "polling_interval_time_min": "1", "exclude_bashrc": false, "custom_script": "#!/bin/sh\n\nsudo apt-get update -y\n\nsudo apt-get install -y \\\n ca-certificates \\\n curl \\\n gnupg \\\n lsb-release\n\nsudo mkdir -p /etc/apt/keyrings\n\ncurl -fsSL
| sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg\n\necho \\\n \"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg]
\\\n $(lsb_release -cs) stable\" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null\n\nsudo apt-get update -y\n\nsudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin\n\nexport CLEARML_API_ACCESS_KEY=XXXX\nexport CLEARML_API_SECRET_KEY=XXXX\n\necho \"AWS_KEY_ID=XXXX\" > ./xhr_env.list\necho \"AWS_KEY_SECRET=XXXX\" >> ./xhr_env.list", "extra_clearml_conf": "sdk.aws.s3 = {\n region: \"aws-global\"\n key: \"XXXXX\"\n secret: \"XXXX\"\n use_credentials_chain: false\n extra_args: {}\n credentials: []\n}\nagent.package_manager.pip_version: \"<21\"\nagent.venvs_cache.path: ~/.clearml/venvs-cache" }
This is an instance than I launched like last week and was running fine until now, the version is v1.6.0-335
Hi FierceHamster54 , is this an old autoscaler instance? What is the version? You can see the version when you're on the application and click on 'More' at the top left text area