
Reputation
Badges 1
89 × Eureka!$ curl -X 'POST' '
' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "url": "
" }' {"digit":5}
The latest commit to the repo is 22.02-py3
( https://github.com/allegroai/clearml-serving/blob/d15bfcade54c7bdd8f3765408adc480d5ceb4b45/clearml_serving/engines/triton/Dockerfile#L2 ) I will have a look at versions now 🙂
Okay just for clarity...
Originally, my Nvidia drivers were running on an incompatible version for the triton serverThis container was built for NVIDIA Driver Release 510.39 or later, but version 470.103.01 was detected and compatibility mode is UNAVAILABLE.
To fix this issue I updated the drivers on my base OS i.e.sudo apt install nvidia-driver-510 -y sudo reboot
Then it worked. The docker-compose logs from clearml-serving-triton
container did not make this clear (i.e. by r...
` python upload_data_to_clearml_copy.py
Generating SHA2 hash for 1 files
100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 733.91it/s]
Hash generation completed
0%| | 0/1 [00:00<?, ?it/s]
Compressing local files, chunk 1 [remaining 1 files]
100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 538.77it/s]
File compression completed: t...
Same with new version(deepmirror) ryan@ryan:~/GitHub/deepmirror/ml-toolbox$ python -c "import clearml; print(clearml.__version__)" 1.6.1
Generating SHA2 hash for 1 files 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2548.18it/s] Hash generation completed Uploading dataset changes (1 files compressed to 130 B) to BUCKET File compression and upload completed: total size 130 B, 1 chunked stored (average size 130 B)
` client.queues.get_default()
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.9/site-packages/clearml/backend_api/session/client/client.py", line 378, in new_func
return Response(self.session.send(request_cls(*args, **kwargs)))
File "/opt/conda/lib/python3.9/site-packages/clearml/backend_api/session/client/client.py", line 122, in send
raise APIError(result)
clearml.backend_api.session.client.client.APIError: APIError: code 4...
Thanks JitteryCoyote63 , I'll double check the permissions of key/secrets and if no luck I'll check with the team
Okay thanks for the update 🙂 the account manager got involved and the limit has been approved 🚀
This was the response from AWS:
"Thank you for for sharing the requested details with us. As we discussed, I'd like to share that our internal service team is currently unable to support any G type vCPU increase request for limit increase.
The issue is we are currently facing capacity scarcity to accommodate P and G instances. Our engineers are working towards fixing this issue. However, until then, we are unable to expand the capacity and process limit increase."
` # dataset_class.py
from PIL import Image
from torch.utils.data import Dataset as BaseDataset
class Dataset(BaseDataset):
def __init__(
self,
images_fps,
masks_fps,
augmentation=None,
):
self.augmentation = augmentation
self.images_fps = images_fps
self.masks_fps = masks_fps
self.ids = len(images_fps)
def __getitem__(self, i):
# read data
img = Image.open(self.images_fps[i])
mask = Image...
echo -e $(aws ssm --region=eu-west-2 get-parameter --name 'my-param' --with-decryption --query "Parameter.Value") | tr -d '"' > .env set -a source .env set +a git clone https://${PAT}@github.com/myrepo/toolbox.git mv .env toolbox/ cd toolbox/ docker-compose up -d --build docker exec -it $(docker-compose ps -q) clearml-agent daemon --detached --gpus 0 --queue default
For ClearML UI2021-10-19 14:24:13 ClearML results page:
Spinning new instance type=aws4gpu ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring 2021-10-19 14:24:18 Error: Can not start new instance, Could not connect to the endpoint URL: "
" Spinning new instance type=aws4gpu 2021-10-19 14:24:28 Error: Can not start new instance, Could not connect to the endpoint URL: "
` "
Spinning new instance type=aws4gpu
2021-10-19 14:24:38
Error: Can no...
Can you try to go into 'Settings' -> 'Configuration' and verify that you have 'Show Hidden Projects' enabled
`
2021-10-19 14:19:07
Spinning new instance type=aws4gpu
Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]
Spinning new instance type=aws4gpu
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]
S...
I'll add a more detailed response once it's working
Yes on the apps page is the possible to tigger programatically?
remote execution is working now. Internal worker nodes had not spun up the agent correctly 😛
I can run clearml.OutputModel(task, framework='pytorch')
to get the model from a previous task. but how can I get the pytorch model ( torch.nn.Module
) from the output model object
so I guess if the status has changed from running to completed
In short we clone the repo, build the docker container, and run agent in the container. The reason we do it this, rather than provide a docker image to the clearml-agent is two fold:
We actively develop our custom networks and architectures within a containerised env to make it easy for engineers to have a quick dev cycle for new models. (same repo is cloned and we build the docker container to work inside) We use the same repo to serve models on our backend (in a slightly different contain...
`
import os
import glob
from clearml import Dataset
DATASET_NAME = "Bug"
DATASET_PROJECT = "ProjectFolder"
TARGET_FOLDER = "clearml_bug"
S3_BUCKET = os.getenv('S3_BUCKET')
if not os.path.exists(TARGET_FOLDER):
os.makedirs(TARGET_FOLDER)
with open(f'{TARGET_FOLDER}/data.txt', 'w') as f:
f.writelines('Hello, ClearML')
target_files = glob.glob(TARGET_FOLDER + "/**/*", recursive=True)
# upload dataset
dataset = Dataset.create(dataset_name=DATASET_NAME, dataset_project=DATASET_PR...
I've got it... i just remembered I can calltask_id
from the cloned tasked and check the status of that 🙂
I was having an issue with availability zone. I was using 'eu-west-2' instead of 'eu-west-2c'
Okay, I'm going to look into this further. We had around 70 volumes that were not deleted but could have been due to something else.