Reputation
Badges 1
89 × Eureka!For referenceimport subprocess for i in ['1', '2']: command = ['python', 'hyp_op.py', '--testnum', f'{i}'] process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
Yep just about to do that. Just annoying to add arg parser etc
echo -e $(aws ssm --region=eu-west-2 get-parameter --name 'my-param' --with-decryption --query "Parameter.Value") | tr -d '"' > .env set -a source .env set +a git clone https://${PAT}@github.com/myrepo/toolbox.git mv .env toolbox/ cd toolbox/ docker-compose up -d --build docker exec -it $(docker-compose ps -q) clearml-agent daemon --detached --gpus 0 --queue default
so I don't think it's an access issue
When I run in the UI I get the following responseError: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]
When I run programatically it just stalls and I don't get any read out
I was having an issue with availability zone. I was using 'eu-west-2' instead of 'eu-west-2c'
Nope AWS aren't approving the increased vCPU request. I've explained the use case several times and they've not approved
gdn4.xlarge (the best price for 16GB of GPU ram). Not so surprising they would want a switch
okay so this could be a python script that generates the clearml.conf in the working dir in the container?
I'll add a more detailed response once it's working
the agent it for replicating what you run locally elsewhere i.e. remote GPU machine
nope you'll just need to install clearml
Not sure if it's a power outage services in London are working and Cambridge services are down 🤔 I'll keep you updated
We use albumentations with scripts that execute remotely and have no issues. Good question from CostlyOstrich36
In short we clone the repo, build the docker container, and run agent in the container. The reason we do it this, rather than provide a docker image to the clearml-agent is two fold:
We actively develop our custom networks and architectures within a containerised env to make it easy for engineers to have a quick dev cycle for new models. (same repo is cloned and we build the docker container to work inside) We use the same repo to serve models on our backend (in a slightly different contain...
(deepmirror) ryan@ryan:~$ python -c "import clearml print(clearml.__version__)" 1.1.4
Yes, it's the dependencies. At the moment I'm doing this as a work around.
` autoscaler = AwsAutoScaler(hyper_params, configurations)
startup_bash_script = [
'...',
]
autoscaler.startup_bash_script = startup_bash_script ` I'd prefer to run it on the Web UI. Also, we seem to have problems when it's executed remotely
Trying to retrieve logs now 🙂 Yes I mean the machines are not accessible. Trying to figure what's going on
lmk if I can expand on this more 🙂
I'm sure it used to be in task.artifacts
but that's returning an empty dict
prev_task.artifacts {}
This was the response from AWS:
"Thank you for for sharing the requested details with us. As we discussed, I'd like to share that our internal service team is currently unable to support any G type vCPU increase request for limit increase.
The issue is we are currently facing capacity scarcity to accommodate P and G instances. Our engineers are working towards fixing this issue. However, until then, we are unable to expand the capacity and process limit increase."
I've got it... i just remembered I can calltask_id
from the cloned tasked and check the status of that 🙂
Still debugging.... That fixed the issue with the
nvcr.io/nvidia/tritonserver:22.02-py3
container which now returns
` =============================
== Triton Inference Server ==
NVIDIA Release 22.02 (build 32400308)
Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Co...
AgitatedDove14 is any working on a GCP or Azura autoscaler at the moment?