Reputation
Badges 1
89 × Eureka!Error: Can not start new instance, Could not connect to the endpoint URL: " "
Yep figured this out yesterday. I had been tagging G type instances with an alarm as a fail safe if the AWS autoscaler failed. The alarm only stopped the instance and didn't terminate it (which deletes the drive). Thanks anyway CostlyOstrich36 and TimelyPenguin76 🙂
Yes on the apps page is the possible to tigger programatically?
remote execution is working now. Internal worker nodes had not spun up the agent correctly 😛
In short we clone the repo, build the docker container, and run agent in the container. The reason we do it this, rather than provide a docker image to the clearml-agent is two fold:
We actively develop our custom networks and architectures within a containerised env to make it easy for engineers to have a quick dev cycle for new models. (same repo is cloned and we build the docker container to work inside) We use the same repo to serve models on our backend (in a slightly different contain...
For referenceimport subprocess for i in ['1', '2']: command = ['python', 'hyp_op.py', '--testnum', f'{i}'] process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
` python upload_data_to_clearml_copy.py
Generating SHA2 hash for 1 files
100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 733.91it/s]
Hash generation completed
0%| | 0/1 [00:00<?, ?it/s]
Compressing local files, chunk 1 [remaining 1 files]
100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 538.77it/s]
File compression completed: t...
The latest commit to the repo is 22.02-py3 ( https://github.com/allegroai/clearml-serving/blob/d15bfcade54c7bdd8f3765408adc480d5ceb4b45/clearml_serving/engines/triton/Dockerfile#L2 ) I will have a look at versions now 🙂
Just for ref if anyone has this issue. I had to update my cuda drivers to 510 on system os
` docker run --gpus=0 -it nvcr.io/nvidia/tritonserver:22.02-py3
=============================
== Triton Inference Server ==
NVIDIA Release 22.02 (build 32400308)
Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
This container image and its contents are gove...
This was the response from AWS:
"Thank you for for sharing the requested details with us. As we discussed, I'd like to share that our internal service team is currently unable to support any G type vCPU increase request for limit increase.
The issue is we are currently facing capacity scarcity to accommodate P and G instances. Our engineers are working towards fixing this issue. However, until then, we are unable to expand the capacity and process limit increase."
gdn4.xlarge (the best price for 16GB of GPU ram). Not so surprising they would want a switch
Nope AWS aren't approving the increased vCPU request. I've explained the use case several times and they've not approved
I'll add a more detailed response once it's working
`
import os
import glob
from clearml import Dataset
DATASET_NAME = "Bug"
DATASET_PROJECT = "ProjectFolder"
TARGET_FOLDER = "clearml_bug"
S3_BUCKET = os.getenv('S3_BUCKET')
if not os.path.exists(TARGET_FOLDER):
os.makedirs(TARGET_FOLDER)
with open(f'{TARGET_FOLDER}/data.txt', 'w') as f:
f.writelines('Hello, ClearML')
target_files = glob.glob(TARGET_FOLDER + "/**/*", recursive=True)
# upload dataset
dataset = Dataset.create(dataset_name=DATASET_NAME, dataset_project=DATASET_PR...
thank you guys 😄 😄
I'm using "allegroai/clearml-serving-triton:latest" container I was just debugging using the base image
From SuccessfulKoala55 suggestion
Okay thanks for the update 🙂 the account manager got involved and the limit has been approved 🚀
AgitatedDove14 is any working on a GCP or Azura autoscaler at the moment?
Trying to retrieve logs now 🙂 Yes I mean the machines are not accessible. Trying to figure what's going on
I've got it... i just remembered I can calltask_idfrom the cloned tasked and check the status of that 🙂
so I guess if the status has changed from running to completed
Hi yes all sorted ! 🙂
I make 2x in eu-west-2 on the AWS console but still no luck
we normally do something like that - not sure what why it's freezing for you without more info
This was the error I was getting from uploads using the old SDKhas been rejected for invalid domain. heap-2443312637.js:2:108655 Referrer Policy: Ignoring the less restricted referrer policy "no-referrer-when-downgrade" for the cross-site request: