Reputation
Badges 1
93 × Eureka!I have managed to create a docker container from the Triton task, and run it interactive mode, however I get a different set of errors, but I think these are related to command line arguments I used to spin up the docker container, compared to the command used by the clearml orchestration system.
My simplified docker command was: docker run -it --gpus all --ipc=host task_id_2cde61ae8b08463b90c3a0766fffbfe9
However, looking at the Triton inference server object logging, I can see there...
It’s an ignite framework trained PyTorch model using one of the three well known vision model packages, TIMM, PYTORCHCV or TORCHVISION,
I am bit confused because I can see configuration sections Azure storage in the clearml.conf files, but these are on the client pc and the clearml-agent compute nodes.
So do these parameters have to be set on the clients and compute nodes individually, or is something that can be set on the server?
I was thinking that I can run on the compute node in the environment that the agent is executed from, but actually it is the environment inside the docker container that the Triton server is executing in.
Could I use the clearml-agent build
command and the Triton serving engine
task ID to create a docker container that I could then use interactively to run these tests?
Just another thought, this couldn’t be caused by using a non default location for clearml.conf
?
I have a clearml.conf
in the default location which is configured for training agents and I created a separate one for the inference service and put it in a sub folde of my home dir. The agent on the default queue to be used for inference serving was execute using clearml-agent daemon —config-file /path/to/clearml.conf
AgitatedDove14 ,
Often a question is asked about a data science project at the beginning, which are like "how long will that take?" or "what are the chances it will work to this accuracy?".
To the uninitiated, these would seem like relatively innocent and easy to answer questions. If a person has a project management background, with more clearly defined technical tasks like software development or mechanical engineering, then often work packages and uncertainties relating to outcomes are m...
This was the code:
` import os
import argparse
# ClearML modules
from clearml import Dataset
parser = argparse.ArgumentParser(description='CUB200 2011 ClearML data uploader - Ed Morris (c) 2021')
parser.add_argument(
'--dataset-basedir',
dest='dataset_basedir',
type=str,
help='The directory to the root of the dataset',
default='/home/edmorris/projects/image_classification/caltech_birds/data/images')
parser.add_argument(
'--clearml-project',
dest='clearml_projec...
This appears to confirm it as well.
https://github.com/pytorch/pytorch/issues/1158
Thanks AgitatedDove14 , you're very helpful.
I have changed the configuration file created by Certbot to listen on port 8080 instead of port 80, however, when I restart the NGINX service, I get errors relating to bindings.
server { listen 8080 default_server; listen [::]:8080 ipv6only=on default_server;
Restarting the service results in the following errors:
` ● nginx.service - A high performance web server and a reverse proxy server
Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: ...
So, AgitatedDove14 what I really like about the approach with ClearML is that you can genuinely bring the architecture into the development process early. That has a lot of desirable outcomes, including versioning and recording of experiments, dataset versioning etc. Also it would enforce a bit more structure in project development, if things are required to fit into a bit more of a defined box (or boxes). However, it also seems to be not too prescriptive, such that I would worry that a lot...
I should say, the company I am working Malvern Panalytical, we are developing an internal MLOps capability, and we are starting to develop a containerized deployment system, for developing, training and deploying machine learning models. Right now we are at the early stages of development, and our current solution is based on using Azure MLOps, which I personally find very clunky.
So I have been tasked with investigating alternatives to replace the training and model deployment side of thing...
SuccessfulKoala55
SUCCESS!!!
This appears to be working.
Setup certifications us sudo certbot --nginx
.
Then edit the default configuration file in /etc/nginx/sites-available
` server {
listen 80;
return 301 https://$host$request_uri;
}
server {
listen 443;
server_name your-domain-name;
ssl_certificate /etc/letsencrypt/live/your-domain-name/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/your-domain-name/privkey.pem;
...
Oh it's a load balancer, so it does that and more.
But I suppose the point holds though, it provides an end-point for external locations, and then handles the routing to the correct resources.
SuccessfulKoala55 I am not that familiar with AWS. Is that essentially a port forwarding service, where you have a secure end point that redirects to the actual server?
Just ran a model which pulled the dataset from the Azure Blob Storage and that seemed to looked correct.
2021-06-04 13:34:21,708 - clearml.storage - INFO - Downloading: 13.00MB / 550.10MB @ 32.59MBs from
Birds%2FDatasets/cub200_2011_train_dataset.37a8f00931b04952a1500e3ada831022/artifacts/data/dataset.37a8f00931b04952a1500e3ada831022.zip 2021-06-04 13:34:21,754 - clearml.storage - INFO - Downloading: 21.00MB / 550.10MB @ 175.54MBs from
` Birds%2FDatasets/cub200_2011_train_dataset...
I think I failed in explaining my self, I meant instead of multiple CUDA versions installed on the same host/docker, wouldn't it make sense to just select a different out-of-the-box docker with the right CUDA, directly from the public nvidia dockerhub offering ? (This is just another argument on the Task that you can adjust), wouldn't that be easier for users?
Absolutely aligned with you there AgitatedDove14 . I understood you correctly.
My default is to work with native VM images, a...
Hmmmm, I thought it logged it with the terminal results when it was uploading weights, but perhaps that's only the live version and the saved version is pruned? Or my memory is wrong.... it is Friday after all!
Can't find anymore reference to it, sorry.
So I've been testing bits and pieces individually.
For example, I made a custom image for the VMSS nodes, which is based on Ubuntu and has multiple CUDA versions installed, as well as conda and docker pre-installed.
I'm managed to test the setup script, so that it executes on a pristine node, and results in a compute node being added to the relevant queue, but that's been executed manually by me, as I have the credentials to log on via SSH.
And I had to do things get the clearml-server the ma...
This job did download pre-trained weights, so the only difference between them is the local dataset cache.
I think perhaps as standard, the group docker
is already created.
The bit that isn't done is making your user part of that group.
I have rerun the serving example with my PyTorch job, but this time I have followed the MNIST Keras example.
I appended a GPU compute resource to the default queue and then executed the service on the default queue.
This resulted in a Triton serving engine container spinning up on the compute resource, however it failed due to the previous issue with ports conflicts:
` 2021-06-08 16:28:49
task f2fbb3218e8243be9f6ab37badbb4856 pulled from 2c28e5db27e24f348e1ff06ba93e80c5 by worker ecm-clear...
Right, I am still a bit confused to be honest.
AgitatedDove14 Thanks for that.
I suppose the same would need to be done for any client PC running clearml such that you are submitting dataset upload jobs?
That is, the dataset is perhaps local to my laptop, or on a development VM that is not in the clearml system, but I from there I want to submit a copy of a dataset, then I would need to configure the storage section in the same way as well?
I assume the account name and key refers to the storage account credentials that you can f...
Ok I think I managed to create a docker image of the Triton instance server, just putting the kids to bed, will have a play afterwards.
In my case it's a Tesla P40, which has 24 GB VRAM.
You need to make sure the user is part of the docker
group.
Follow these commands post install of Docker engine, and don't forget to restart the terminal session for the changes to take full effect .
` sudo groupadd docker
sudo usermod -aG docker ${USER} `Don't install Docker engine with root, your sysadmin will have kittens!
AgitatedDove14 in this remote session on the compute node, where I am manually importing the clearml
sdk, what's the easiest way to confirm that the Azure credentials are being imported correctly?
I assume from our discussions yesterday on the dockers, that when the orchestration agent daemon is run with a given clearml.conf
, I can see that the docker run command has various flags being used to pass certain files and environment variables from the host operating system of the co...
When I run the commands above you suggested, if I run them on the compute node but on the host system within conda environment I installed to run the agent daemon from, I get the issues as we appear to have seen when executing the Triton inference service.
` (py38_clearml_serving_git_dev) edmorris@ecm-clearml-compute-gpu-002:~$ python
Python 3.8.10 (default, May 19 2021, 18:05:58)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
...
Thanks for the last tip, "easy mistaker to maker"