AgitatedDove14 Ok I can do that.
I was just thinking it through.
Would this be best if it were executed in the Triton execution environment?
I was thinking that I can run on the compute node in the environment that the agent is executed from, but actually it is the environment inside the docker container that the Triton server is executing in.
Could I use the clearml-agent build
command and the Triton serving engine
task ID to create a docker container that I could then use interactively to run these tests?
Ok I think I managed to create a docker image of the Triton instance server, just putting the kids to bed, will have a play afterwards.
Would this be best if it were executed in the Triton execution environment?
It seems the issue is unrelated to the Triton ...
Could I use the
clearml-agent build
command and the
Triton serving engine
task ID to create a docker container that I could then use interactively to run these tests?
Yep, that should do it 🙂
I would start simple, no need to get the docker itself it seems like clearml credentials issue?!
Just another thought, this couldn’t be caused by using a non default location for clearml.conf
?
I have a clearml.conf
in the default location which is configured for training agents and I created a separate one for the inference service and put it in a sub folde of my home dir. The agent on the default queue to be used for inference serving was execute using clearml-agent daemon —config-file /path/to/clearml.conf
If you passed the correct path it should work (if it fails it would have failed right at the beginning).
BTW: I think it is clearml-agent --config-file <file here> daemon ...
My bad you are correct, it is as you say.
Right, I am still a bit confused to be honest.
When I run the commands above you suggested, if I run them on the compute node but on the host system within conda environment I installed to run the agent daemon from, I get the issues as we appear to have seen when executing the Triton inference service.
` (py38_clearml_serving_git_dev) edmorris@ecm-clearml-compute-gpu-002:~$ python
Python 3.8.10 (default, May 19 2021, 18:05:58)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
from clearml import InputModel
model = InputModel(model_id="57ed24c1011346d292ecc9e797ccb47e")
model.url
'Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt'
model.get_local_copy()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/model.py", line 424, in get_local_copy
return self.get_weights(raise_on_error=raise_on_error)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/model.py", line 285, in get_weights
return self._get_base_model().download_model_weights(raise_on_error=raise_on_error)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/backend_interface/model.py", line 403, in download_model_weights
local_download = StorageManager.get_local_copy(uri, extract_archive=False)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/manager.py", line 46, in get_local_copy
cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/cache.py", line 33, in get_local_copy
helper = StorageHelper.get(remote_url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 227, in get
base_url = cls._resolve_base_url(url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 831, in _resolve_base_url
raise StorageError("Can't find azure configuration for {}".format(base_url))
clearml.storage.helper.StorageError: Can't find azure configuration forBirds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
model.get_weights()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/model.py", line 285, in get_weights
return self._get_base_model().download_model_weights(raise_on_error=raise_on_error)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/backend_interface/model.py", line 403, in download_model_weights
local_download = StorageManager.get_local_copy(uri, extract_archive=False)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/manager.py", line 46, in get_local_copy
cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/cache.py", line 33, in get_local_copy
helper = StorageHelper.get(remote_url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 227, in get
base_url = cls._resolve_base_url(url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 831, in _resolve_base_url
raise StorageError("Can't find azure configuration for {}".format(base_url))
clearml.storage.helper.StorageError: Can't find azure configuration forBirds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt `
I have managed to create a docker container from the Triton task, and run it interactive mode, however I get a different set of errors, but I think these are related to command line arguments I used to spin up the docker container, compared to the command used by the clearml orchestration system.
My simplified docker command was: docker run -it --gpus all --ipc=host task_id_2cde61ae8b08463b90c3a0766fffbfe9
However, looking at the Triton inference server object logging, I can see there are considerably more command line arguments for the docker container when it is launched by the agent orchestration. Some of these I think are relating to the clearml.conf
setup within the Triton execution environment.
This is the full list of arguments that are passed to docker run command by the clearml-agent orchestration when the Triton inference server service is launch:1623251452680 ecm-clearml-compute-gpu-002:0 INFO Executing: ['docker', 'run', '-t', '--gpus', 'all', '--ipc=host', '-e', 'CLEARML_WORKER_ID=ecm-clearml-compute-gpu-002:0', '-e', 'CLEARML_DOCKER_IMAGE=nvcr.io/nvidia/tritonserver:21.03-py3 --ipc=host', '-v', '/home/edmorris/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.tv_9cnv6.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.ggzbd0vn:/root/.ssh', '-v', '/home/edmorris/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/edmorris/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/edmorris/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/edmorris/.clearml/cache:/clearml_agent_cache', '-v', '/home/edmorris/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvcr.io/nvidia/tritonserver:21.03-py3', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0" ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL git" ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL python3-pip" ; [ -z "$CLEARML_APT_INSTALL" ] || (apt-get update && apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 2cde61ae8b08463b90c3a0766fffbfe9']
VivaciousPenguin66 I have the feeling it is the first space in the URI that breaks the credentials lookup.
Let's test it:from clearml import StorageManager uri = '
` Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt'
original
StoargeManager.get_local_copy(uri)
qouted
StoargeManager.get_local_copy(uri.replace(' ', '%20')) `
Mr AgitatedDove14 Good spot sir!
Sounds like a good candidate, I will test now and report back.
AgitatedDove14
Ok so I ran both variations and I got the same results.
` >>> from clearml import StorageManager
uri_a = '
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt'
uri_b = 'Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt'.replace(' ' ,'%20')
StorageManager.get_local_copy(uri_a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/manager.py", line 46, in get_local_copy
cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/cache.py", line 33, in get_local_copy
helper = StorageHelper.get(remote_url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 227, in get
base_url = cls._resolve_base_url(url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 831, in _resolve_base_url
raise StorageError("Can't find azure configuration for {}".format(base_url))
clearml.storage.helper.StorageError: Can't find azure configuration forBirds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
StorageManager.get_local_copy(uri_b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/manager.py", line 46, in get_local_copy
cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/cache.py", line 33, in get_local_copy
helper = StorageHelper.get(remote_url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 227, in get
base_url = cls._resolve_base_url(url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 831, in _resolve_base_url
raise StorageError("Can't find azure configuration for {}".format(base_url))
clearml.storage.helper.StorageError: Can't find azure configuration for `
Looking at the _resolve_base_url()
method of the StorageHelper
class I can see that it is using furl
to handle the path splitting for getting at the Azure storage account and container names.
Replicating the commands, the first one to get the Storage Account seems to have worked ok:
f = furl.furl(uri) account_name = f.host.partition(".")[0]
Replicating above manually seems to give the same answer for both and it looks correct to me:
` >>> import furl
f_a = furl.furl(uri_a)
f_a.host.partition('.')[0]
'clearmllibrary'
f_b = furl.furl(uri_b)
f_b.host.partition('.')[0]
'clearmllibrary' `
So moving onto the container name.
Original code has the following calls:
if not f.path.segments: raise ValueError( "URI {} is missing a container name (expected " "[https/azure]://<account-name>.../<container-name>)".format( uri ) ) container = f.path.segments[0]
Repeating the same commands locally results in the following:
` >>> f_a.path.segments
['artefacts', 'Caltech Birds%2FTraining', 'TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253', 'models', 'cub200_resnet34_ignite_best_model_0.pt']
if not f_a.path.segments:
... print('An error will be raised')
... else:
... print('No error will be raised')
...
No error will be raised
f_a.path.segments[0]
'artefacts'
f_b.path.segments
['artefacts', 'Caltech Birds/Training', 'TRAIN%20[Network%3A%20resnet34,%20Library%3A%20torchvision]%20Ignite%20Train%20PyTorch%20CNN%20on%20CUB200.8611ada5be6f4bb6ba09cf730ecd2253', 'models', 'cub200_resnet34_ignite_best_model_0.pt']
if not f_b.path.segments:
... print('An error will be raised')
... else:
... print('No error will be raised')
...
No error will be raised
f_b.path.segments[0]
'artefacts' `In both cases the Azure storage container name is correct as well.
So can you verify it can download the model ?
(Also can you share the clearml.conf, without actual creds 😉 )
Thanks for the last tip, "easy mistaker to maker"
AgitatedDove14
So can you verify it can download the model ?
Unfortunately it's still falling over, but then I got the same result for the credentials using both URI strings, the original, and the modified version, so it points to something else going on.
I note that the StorageHelper.get()
method has a call which modifies the URI prior to it being passed to the function which gets the storage account and container name. However, when I run this locally, it doesn't seem to do anything to the paths.
` >>> uri_a == StorageHelper._canonize_url(uri_a)
True
uri_b == StorageHelper._canonize_url(uri_b)
True
uri_a == StorageHelper._canonize_url(uri_a)+' '
False
uri_a == StorageHelper._canonize_url(uri_a)
True `
I've used the Azure Storage Explorer application to go and find the model weights file and confirm that it is the correct path, by copying the path of the file into a new URI and doing the same test (this has path character substitutions in it already), and I got the same result:
>>> uri_c = '
`
t'
helper = StorageHelper.get(uri_c)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 227, in get
base_url = cls._resolve_base_url(url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 831, in _resolve_base_url
raise StorageError("Can't find azure configuration for {}".format(base_url))
clearml.storage.helper.StorageError: Can't find azure configuration for `
AgitatedDove14 in this remote session on the compute node, where I am manually importing the clearml
sdk, what's the easiest way to confirm that the Azure credentials are being imported correctly?
I assume from our discussions yesterday on the dockers, that when the orchestration agent daemon is run with a given clearml.conf
, I can see that the docker run command has various flags being used to pass certain files and environment variables from the host operating system of the compute node, to the docker container environment used for execution of the task. Therefore, is it safe to assume that if clearml sdk
works on the manual session on the compute node host environment, that will translate perfectly into the docker execution environment?
Crawls out from under the table and takes a deep breath
AgitatedDove14 you remember we talked about it being a bug or a stupid.....
Well, it's a stupid by me.... somehow I managed to propagate irregularities in the clearml.conf
file such that it successfully loaded, but the expected nested structure was not there.
When the get_local_copy()
method requested the model, it correctly got the azure credentials, however when the StorageHelper
class tries to get the azure credentials from the configuration file, it returns an empty dictionary structure, because of the mishaped conf file. The line question in the StorageHelper
class is [ https://github.com/allegroai/clearml/blob/332ceab3eadef4997e897d171957975a247a6dc1/clearml/storage/helper.py#L195 ]:
_azure_configurations = AzureContainerConfigurations.from_config(config.get('azure.storage', {}))
Thus, when the get_config_by_uri
method returns this empty structure in the _resolve_base_url
method, it is then reported as Azure Credentials are not found. This is because the conf file had been corrupted in shape and thus when it tried to find the azure.storage
fields, it doesn't find them.
Perhaps what is needed is some checking of the clearml.conf file to make sure that at least the basic fields are present and complain if not?
So, I have rerun clearml-agent init
and transposed over my custom configurations and then run the StorageManager.get_local_copy(uri)
method using the original URI, and bingo!:
Crawls out from under the table and takes a deep breath
AgitatedDove14 you remember we talked about it being a bug or a stupid.....
Well, it's a stupid by me.... somehow I managed to propagate irregularities in the clearml.conf
file such that it successfully loaded, but the expected nested structure was not there.
When the get_local_copy()
method requested the model, it correctly got the azure credentials, however when the StorageHelper
class tries to get the azure credentials from the configuration file, it returns an empty dictionary structure, because of the mishaped conf file. The line question in the StorageHelper
class is [ https://github.com/allegroai/clearml/blob/332ceab3eadef4997e897d171957975a247a6dc1/clearml/storage/helper.py#L195 ]:
_azure_configurations = AzureContainerConfigurations.from_config(config.get('azure.storage', {}))
Thus, when the get_config_by_uri
method returns this empty structure in the _resolve_base_url
method, it is then reported as Azure Credentials are not found. This is because the conf file had been corrupted in shape and thus when it tried to find the azure.storage
fields, it doesn't find them.
Perhaps what is needed is some checking of the clearml.conf file to make sure that at least the basic fields are present and complain if not?
So, I have rerun clearml-agent init
and transposed over my custom configurations and then run the StorageManager.get_local_copy(uri)
method using the original URI, and bingo!:
2021-06-10 13:23:42,110 - clearml.storage - INFO - Downloading: 5.00MB / 81.72MB @ 20.34MBs from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,346 - clearml.storage - INFO - Downloading: 13.00MB / 81.72MB @ 33.93MBs from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,379 - clearml.storage - INFO - Downloading: 21.00MB / 81.72MB @ 243.96MBs from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,537 - clearml.storage - INFO - Downloading: 29.00MB / 81.72MB @ 50.48MBs from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,582 - clearml.storage - INFO - Downloading: 37.00MB / 81.72MB @ 176.97MBs from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,801 - clearml.storage - INFO - Downloading: 45.00MB / 81.72MB @ 36.65MBs from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,848 - clearml.storage - INFO - Downloading: 53.00MB / 81.72MB @ 168.57MBs from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,893 - clearml.storage - INFO - Downloading: 61.00MB / 81.72MB @ 179.29MBs from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:43,082 - clearml.storage - INFO - Downloading: 69.72MB / 81.72MB @ 46.13MBs from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:43,412 - clearml.storage - INFO - Downloading: 77.72MB / 81.72MB @ 24.21MBs from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:43,450 - clearml.storage - INFO - Downloaded 81.72 MB successfully from
Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt , saved to /home/edmorris/.clearml/cache/storage_manager/global/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt '/home/edmorris/.clearml/cache/storage_manager/global/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt'
AgitatedDove14
Ok, after configuration file huge detour, we are now back to fixing genuine issues here.
To recap, in order to get the Triton container to run and to be able to connect to Azure Blob Storage, the following changes were made to the launch_engine
method of the ServingService
class:
For the task creation call:
The docker string was changed remove the port specifications [to avoid the port conflicts error]. The addition of packages argument was required, as the docker container environment required the azure-storage-blob
python package in order to connect to Azure Blob Storage to download the model weights.
` def launch_engine(self, queue_name, queue_id=None, verbose=True):
# type: (Optional[str], Optional[str], bool) -> None
"""
Launch serving engine on a specific queue
:param queue_name: Queue name to launch the engine service running the inference on.
:param queue_id: specify queue id (unique stand stable) instead of queue_name
:param verbose: If True print progress to console
"""
# todo: add more engines
if self._engine_type == 'triton':
# create the serving engine Task
engine_task = Task.create(
project_name=self._task.get_project_name(),
task_name="triton serving engine",
task_type=Task.TaskTypes.inference,
repo=" ` ` ", # for testing purposes, get the forked copy of the clearml-serving package.
branch="main",
#commit="b6355a1db8da307750e37e9cb37a5fc23876c8dd", # need to grab the latest commit automatically
script="clearml_serving/triton_helper.py",
working_directory=".",
docker="nvcr.io/nvidia/tritonserver:21.03-py3 --ipc=host ", # removed -p 8000:8000 -p 8001:8001 -p 8002:8002
argparse_args=[('serving_id', self._task.id), ],
add_task_init_call=False,
packages=['azure-storage-blob==2.1.0'], # added as suspected Azure SDK was needed for Blob Store access.
)
if verbose:
print('Launching engine {} on queue {}'.format(self._engine_type, queue_id or queue_name))
engine_task.enqueue(task=engine_task, queue_name=queue_name, queue_id=queue_id) `
Once this was solved, the Triton server was still reporting not being able to find the model.
Investigations of the triton_model_service_update_step
method found that the expected name of the locally cached model was just simply http://model.pt , however the process of copying from the local cached directory to the directory for serving resulted in the original filename being used, and the model was not found. The target path below shows the final location of the model ready for the Triton server, which is clearly not http://model.pt .
[INFO] Target Path:: /models/cub200_resnet34/1/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt [INFO] Local Path:: /clearml_agent_cache/storage_manager/global/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt
As a temporary fix, I added the following at the bottom of the method to rename the weights file to the expected name:
new_target_path = Path(os.path.join(target_path.parent),'model.pt') shutil.move(target_path.as_posix(), new_target_path.as_posix())
This has resulted in the model finally being found, but now we have different errors!!!
😞 😞 😞
After finally getting the model to be recognized by the Triton server, it now fails with the attached error messages.
Any ideas AgitatedDove14 ?
Yes this is Triton failing to load the actual model file
Does this file look familiar to you?file not found: archive/constants.pkl
It’s an ignite framework trained PyTorch model using one of the three well known vision model packages, TIMM, PYTORCHCV or TORCHVISION,
I don’t have a scooby doo what that pickle file is.
Google to the rescue:
https://github.com/pytorch/pytorch/issues/47917
Fixes and identified issues can be found in these github comments.
Closing the discussion here.