
Reputation
Badges 1
89 × Eureka!I can raise this as an issue on the repo if that is useful?
lmk if I can expand on this more 🙂
`
2021-10-19 14:19:07
Spinning new instance type=aws4gpu
Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]
Spinning new instance type=aws4gpu
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]
S...
Still debugging.... That fixed the issue with the
nvcr.io/nvidia/tritonserver:22.02-py3
container which now returns
` =============================
== Triton Inference Server ==
NVIDIA Release 22.02 (build 32400308)
Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Co...
I make 2x in eu-west-2 on the AWS console but still no luck
For ClearML UI2021-10-19 14:24:13 ClearML results page:
Spinning new instance type=aws4gpu ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring 2021-10-19 14:24:18 Error: Can not start new instance, Could not connect to the endpoint URL: "
" Spinning new instance type=aws4gpu 2021-10-19 14:24:28 Error: Can not start new instance, Could not connect to the endpoint URL: "
` "
Spinning new instance type=aws4gpu
2021-10-19 14:24:38
Error: Can no...
Spin up instance using AWS auto-scaler and use the init script to:
Get key-value pairs from AWS ssm and write to .env file clone private git repo build docker-image locally and use .env file during docker-compose enter container and spin up clearml-agent
echo -e $(aws ssm --region=eu-west-2 get-parameter --name 'my-param' --with-decryption --query "Parameter.Value") | tr -d '"' > .env set -a source .env set +a git clone https://${PAT}@github.com/myrepo/toolbox.git mv .env toolbox/ cd toolbox/ docker-compose up -d --build docker exec -it $(docker-compose ps -q) clearml-agent daemon --detached --gpus 0 --queue default
The latest commit to the repo is 22.02-py3
( https://github.com/allegroai/clearml-serving/blob/d15bfcade54c7bdd8f3765408adc480d5ceb4b45/clearml_serving/engines/triton/Dockerfile#L2 ) I will have a look at versions now 🙂
Sure, I'll check this out later in the week and get back to you
Okay, I'm going to look into this further. We had around 70 volumes that were not deleted but could have been due to something else.
Hi SuccessfulKoala55 who's the best person on the team to speak with?
It doesn't help that the stacktrace isn't very verbose
I can run clearml.OutputModel(task, framework='pytorch')
to get the model from a previous task. but how can I get the pytorch model ( torch.nn.Module
) from the output model object
In short we clone the repo, build the docker container, and run agent in the container. The reason we do it this, rather than provide a docker image to the clearml-agent is two fold:
We actively develop our custom networks and architectures within a containerised env to make it easy for engineers to have a quick dev cycle for new models. (same repo is cloned and we build the docker container to work inside) We use the same repo to serve models on our backend (in a slightly different contain...
Okay thanks for the update 🙂 the account manager got involved and the limit has been approved 🚀
I'll like to call Run Time
via the task object.... I think I need to calculate manually
i.e.
task = clearml.Task.get_task(id) time = task.data.last_update - task.data.started
Umm no luck
q = client.queues.get_all(name='default')[0] from_date = math.floor(datetime.timestamp(datetime.now() - relativedelta(months=3))) to_date = math.floor(datetime.timestamp(datetime.now())) res = client.queues.get_queue_metrics(from_date=from_date, to_date=to_date, interval=1, queue_ids=[q.id])
Going for something like this:
` >>> queue = QueueMetrics(queue='queueid')
queue.avg_waiting_times `
I've got it... i just remembered I can calltask_id
from the cloned tasked and check the status of that 🙂
This was the response from AWS:
"Thank you for for sharing the requested details with us. As we discussed, I'd like to share that our internal service team is currently unable to support any G type vCPU increase request for limit increase.
The issue is we are currently facing capacity scarcity to accommodate P and G instances. Our engineers are working towards fixing this issue. However, until then, we are unable to expand the capacity and process limit increase."
gdn4.xlarge (the best price for 16GB of GPU ram). Not so surprising they would want a switch