AgitatedDove14 Ok I can do that.
I was just thinking it through.
Would this be best if it were executed in the Triton execution environment?

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

I was thinking that I can run on the compute node in the environment that the agent is executed from, but actually it is the environment inside the docker container that the Triton server is executing in.

Could I use the clearml-agent build command and the Triton serving engine task ID to create a docker container that I could then use interactively to run these tests?

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Ok I think I managed to create a docker image of the Triton instance server, just putting the kids to bed, will have a play afterwards.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Would this be best if it were executed in the Triton execution environment?

It seems the issue is unrelated to the Triton ...

Could I use the

clearml-agent build

command and the

Triton serving engine

task ID to create a docker container that I could then use interactively to run these tests?

Yep, that should do it 🙂
I would start simple, no need to get the docker itself it seems like clearml credentials issue?!

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Just another thought, this couldn’t be caused by using a non default location for clearml.conf ?

I have a clearml.conf in the default location which is configured for training agents and I created a separate one for the inference service and put it in a sub folde of my home dir. The agent on the default queue to be used for inference serving was execute using clearml-agent daemon —config-file /path/to/clearml.conf

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

If you passed the correct path it should work (if it fails it would have failed right at the beginning).
BTW: I think it is clearml-agent --config-file <file here> daemon ...

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

My bad you are correct, it is as you say.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Right, I am still a bit confused to be honest.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

When I run the commands above you suggested, if I run them on the compute node but on the host system within conda environment I installed to run the agent daemon from, I get the issues as we appear to have seen when executing the Triton inference service.

` (py38_clearml_serving_git_dev) edmorris@ecm-clearml-compute-gpu-002:~$ python
Python 3.8.10 (default, May 19 2021, 18:05:58)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

from clearml import InputModel
model = InputModel(model_id="57ed24c1011346d292ecc9e797ccb47e")
model.url
' Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt'
model.get_local_copy()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/model.py", line 424, in get_local_copy
return self.get_weights(raise_on_error=raise_on_error)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/model.py", line 285, in get_weights
return self._get_base_model().download_model_weights(raise_on_error=raise_on_error)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/backend_interface/model.py", line 403, in download_model_weights
local_download = StorageManager.get_local_copy(uri, extract_archive=False)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/manager.py", line 46, in get_local_copy
cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/cache.py", line 33, in get_local_copy
helper = StorageHelper.get(remote_url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 227, in get
base_url = cls._resolve_base_url(url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 831, in _resolve_base_url
raise StorageError("Can't find azure configuration for {}".format(base_url))
clearml.storage.helper.StorageError: Can't find azure configuration for Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
model.get_weights()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/model.py", line 285, in get_weights
return self._get_base_model().download_model_weights(raise_on_error=raise_on_error)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/backend_interface/model.py", line 403, in download_model_weights
local_download = StorageManager.get_local_copy(uri, extract_archive=False)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/manager.py", line 46, in get_local_copy
cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/cache.py", line 33, in get_local_copy
helper = StorageHelper.get(remote_url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 227, in get
base_url = cls._resolve_base_url(url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 831, in _resolve_base_url
raise StorageError("Can't find azure configuration for {}".format(base_url))
clearml.storage.helper.StorageError: Can't find azure configuration for Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt `

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

I have managed to create a docker container from the Triton task, and run it interactive mode, however I get a different set of errors, but I think these are related to command line arguments I used to spin up the docker container, compared to the command used by the clearml orchestration system.

My simplified docker command was: docker run -it --gpus all --ipc=host task_id_2cde61ae8b08463b90c3a0766fffbfe9

However, looking at the Triton inference server object logging, I can see there are considerably more command line arguments for the docker container when it is launched by the agent orchestration. Some of these I think are relating to the clearml.conf setup within the Triton execution environment.

This is the full list of arguments that are passed to docker run command by the clearml-agent orchestration when the Triton inference server service is launch:
1623251452680 ecm-clearml-compute-gpu-002:0 INFO Executing: ['docker', 'run', '-t', '--gpus', 'all', '--ipc=host', '-e', 'CLEARML_WORKER_ID=ecm-clearml-compute-gpu-002:0', '-e', 'CLEARML_DOCKER_IMAGE=nvcr.io/nvidia/tritonserver:21.03-py3 --ipc=host', '-v', '/home/edmorris/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.tv_9cnv6.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.ggzbd0vn:/root/.ssh', '-v', '/home/edmorris/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/edmorris/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/edmorris/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/edmorris/.clearml/cache:/clearml_agent_cache', '-v', '/home/edmorris/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvcr.io/nvidia/tritonserver:21.03-py3', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0" ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL git" ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL python3-pip" ; [ -z "$CLEARML_APT_INSTALL" ] || (apt-get update && apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 2cde61ae8b08463b90c3a0766fffbfe9']

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

VivaciousPenguin66 I have the feeling it is the first space in the URI that breaks the credentials lookup.
Let's test it:
from clearml import StorageManager uri = ' ` Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt'

original

StoargeManager.get_local_copy(uri)

qouted

StoargeManager.get_local_copy(uri.replace(' ', '%20')) `

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Mr AgitatedDove14 Good spot sir!
Sounds like a good candidate, I will test now and report back.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

AgitatedDove14

Ok so I ran both variations and I got the same results.

` >>> from clearml import StorageManager

uri_a = ' Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt'
uri_b = ' Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt'.replace(' ' ,'%20')
StorageManager.get_local_copy(uri_a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/manager.py", line 46, in get_local_copy
cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/cache.py", line 33, in get_local_copy
helper = StorageHelper.get(remote_url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 227, in get
base_url = cls._resolve_base_url(url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 831, in _resolve_base_url
raise StorageError("Can't find azure configuration for {}".format(base_url))
clearml.storage.helper.StorageError: Can't find azure configuration for Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
StorageManager.get_local_copy(uri_b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/manager.py", line 46, in get_local_copy
cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/cache.py", line 33, in get_local_copy
helper = StorageHelper.get(remote_url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 227, in get
base_url = cls._resolve_base_url(url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 831, in _resolve_base_url
raise StorageError("Can't find azure configuration for {}".format(base_url))
clearml.storage.helper.StorageError: Can't find azure configuration for `

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Looking at the _resolve_base_url() method of the StorageHelper class I can see that it is using furl to handle the path splitting for getting at the Azure storage account and container names.

Replicating the commands, the first one to get the Storage Account seems to have worked ok:

f = furl.furl(uri) account_name = f.host.partition(".")[0]Replicating above manually seems to give the same answer for both and it looks correct to me:

` >>> import furl

f_a = furl.furl(uri_a)
f_a.host.partition('.')[0]
'clearmllibrary'
f_b = furl.furl(uri_b)
f_b.host.partition('.')[0]
'clearmllibrary' `

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

So moving onto the container name.
Original code has the following calls:

if not f.path.segments: raise ValueError( "URI {} is missing a container name (expected " "[https/azure]://<account-name>.../<container-name>)".format( uri ) ) container = f.path.segments[0]
Repeating the same commands locally results in the following:

` >>> f_a.path.segments
['artefacts', 'Caltech Birds%2FTraining', 'TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253', 'models', 'cub200_resnet34_ignite_best_model_0.pt']

if not f_a.path.segments:
... print('An error will be raised')
... else:
... print('No error will be raised')
...
No error will be raised
f_a.path.segments[0]
'artefacts'
f_b.path.segments
['artefacts', 'Caltech Birds/Training', 'TRAIN%20[Network%3A%20resnet34,%20Library%3A%20torchvision]%20Ignite%20Train%20PyTorch%20CNN%20on%20CUB200.8611ada5be6f4bb6ba09cf730ecd2253', 'models', 'cub200_resnet34_ignite_best_model_0.pt']
if not f_b.path.segments:
... print('An error will be raised')
... else:
... print('No error will be raised')
...
No error will be raised
f_b.path.segments[0]
'artefacts' `In both cases the Azure storage container name is correct as well.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

So can you verify it can download the model ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

(Also can you share the clearml.conf, without actual creds 😉 )

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks for the last tip, "easy mistaker to maker"

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

AgitatedDove14

So can you verify it can download the model ?

Unfortunately it's still falling over, but then I got the same result for the credentials using both URI strings, the original, and the modified version, so it points to something else going on.

I note that the StorageHelper.get() method has a call which modifies the URI prior to it being passed to the function which gets the storage account and container name. However, when I run this locally, it doesn't seem to do anything to the paths.

` >>> uri_a == StorageHelper._canonize_url(uri_a)
True

uri_b == StorageHelper._canonize_url(uri_b)
True
uri_a == StorageHelper._canonize_url(uri_a)+' '
False
uri_a == StorageHelper._canonize_url(uri_a)
True `
I've used the Azure Storage Explorer application to go and find the model weights file and confirm that it is the correct path, by copying the path of the file into a new URI and doing the same test (this has path character substitutions in it already), and I got the same result:

>>> uri_c = ' `
t'

helper = StorageHelper.get(uri_c)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 227, in get
base_url = cls._resolve_base_url(url)
File "/home/edmorris/.conda/envs/py38_clearml_serving_git_dev/lib/python3.8/site-packages/clearml/storage/helper.py", line 831, in _resolve_base_url
raise StorageError("Can't find azure configuration for {}".format(base_url))
clearml.storage.helper.StorageError: Can't find azure configuration for `

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

AgitatedDove14 in this remote session on the compute node, where I am manually importing the clearml sdk, what's the easiest way to confirm that the Azure credentials are being imported correctly?

I assume from our discussions yesterday on the dockers, that when the orchestration agent daemon is run with a given clearml.conf , I can see that the docker run command has various flags being used to pass certain files and environment variables from the host operating system of the compute node, to the docker container environment used for execution of the task. Therefore, is it safe to assume that if clearml sdk works on the manual session on the compute node host environment, that will translate perfectly into the docker execution environment?

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Crawls out from under the table and takes a deep breath

AgitatedDove14 you remember we talked about it being a bug or a stupid.....

Well, it's a stupid by me.... somehow I managed to propagate irregularities in the clearml.conf file such that it successfully loaded, but the expected nested structure was not there.

When the get_local_copy() method requested the model, it correctly got the azure credentials, however when the StorageHelper class tries to get the azure credentials from the configuration file, it returns an empty dictionary structure, because of the mishaped conf file. The line question in the StorageHelper class is [ https://github.com/allegroai/clearml/blob/332ceab3eadef4997e897d171957975a247a6dc1/clearml/storage/helper.py#L195 ]:

_azure_configurations = AzureContainerConfigurations.from_config(config.get('azure.storage', {}))

Thus, when the get_config_by_uri method returns this empty structure in the _resolve_base_url method, it is then reported as Azure Credentials are not found. This is because the conf file had been corrupted in shape and thus when it tried to find the azure.storage fields, it doesn't find them.

Perhaps what is needed is some checking of the clearml.conf file to make sure that at least the basic fields are present and complain if not?

So, I have rerun clearml-agent init and transposed over my custom configurations and then run the StorageManager.get_local_copy(uri) method using the original URI, and bingo!:

Crawls out from under the table and takes a deep breath

AgitatedDove14 you remember we talked about it being a bug or a stupid.....

Well, it's a stupid by me.... somehow I managed to propagate irregularities in the clearml.conf file such that it successfully loaded, but the expected nested structure was not there.

When the get_local_copy() method requested the model, it correctly got the azure credentials, however when the StorageHelper class tries to get the azure credentials from the configuration file, it returns an empty dictionary structure, because of the mishaped conf file. The line question in the StorageHelper class is [ https://github.com/allegroai/clearml/blob/332ceab3eadef4997e897d171957975a247a6dc1/clearml/storage/helper.py#L195 ]:

_azure_configurations = AzureContainerConfigurations.from_config(config.get('azure.storage', {}))

Thus, when the get_config_by_uri method returns this empty structure in the _resolve_base_url method, it is then reported as Azure Credentials are not found. This is because the conf file had been corrupted in shape and thus when it tried to find the azure.storage fields, it doesn't find them.

Perhaps what is needed is some checking of the clearml.conf file to make sure that at least the basic fields are present and complain if not?

So, I have rerun clearml-agent init and transposed over my custom configurations and then run the StorageManager.get_local_copy(uri) method using the original URI, and bingo!:

2021-06-10 13:23:42,110 - clearml.storage - INFO - Downloading: 5.00MB / 81.72MB @ 20.34MBs from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,346 - clearml.storage - INFO - Downloading: 13.00MB / 81.72MB @ 33.93MBs from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,379 - clearml.storage - INFO - Downloading: 21.00MB / 81.72MB @ 243.96MBs from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,537 - clearml.storage - INFO - Downloading: 29.00MB / 81.72MB @ 50.48MBs from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,582 - clearml.storage - INFO - Downloading: 37.00MB / 81.72MB @ 176.97MBs from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,801 - clearml.storage - INFO - Downloading: 45.00MB / 81.72MB @ 36.65MBs from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,848 - clearml.storage - INFO - Downloading: 53.00MB / 81.72MB @ 168.57MBs from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:42,893 - clearml.storage - INFO - Downloading: 61.00MB / 81.72MB @ 179.29MBs from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:43,082 - clearml.storage - INFO - Downloading: 69.72MB / 81.72MB @ 46.13MBs from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:43,412 - clearml.storage - INFO - Downloading: 77.72MB / 81.72MB @ 24.21MBs from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt 2021-06-10 13:23:43,450 - clearml.storage - INFO - Downloaded 81.72 MB successfully from Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt , saved to /home/edmorris/.clearml/cache/storage_manager/global/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt '/home/edmorris/.clearml/cache/storage_manager/global/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt'

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

AgitatedDove14

Ok, after configuration file huge detour, we are now back to fixing genuine issues here.

To recap, in order to get the Triton container to run and to be able to connect to Azure Blob Storage, the following changes were made to the launch_engine method of the ServingService class:

For the task creation call:

The docker string was changed remove the port specifications [to avoid the port conflicts error]. The addition of packages argument was required, as the docker container environment required the azure-storage-blob python package in order to connect to Azure Blob Storage to download the model weights.
` def launch_engine(self, queue_name, queue_id=None, verbose=True):
# type: (Optional[str], Optional[str], bool) -> None
"""
Launch serving engine on a specific queue

    :param queue_name: Queue name to launch the engine service running the inference on.
    :param queue_id: specify queue id (unique stand stable) instead of queue_name
    :param verbose: If True print progress to console
    """

    # todo: add more engines
    if self._engine_type == 'triton':
        # create the serving engine Task
        engine_task = Task.create(
            project_name=self._task.get_project_name(),
            task_name="triton serving engine",
            task_type=Task.TaskTypes.inference,
            repo=" ` ` ", # for testing purposes, get the forked copy of the clearml-serving package.
            branch="main",
            #commit="b6355a1db8da307750e37e9cb37a5fc23876c8dd", # need to grab the latest commit automatically
            script="clearml_serving/triton_helper.py",
            working_directory=".",
            docker="nvcr.io/nvidia/tritonserver:21.03-py3 --ipc=host ", # removed -p 8000:8000 -p 8001:8001 -p 8002:8002
            argparse_args=[('serving_id', self._task.id), ],
            add_task_init_call=False,
            packages=['azure-storage-blob==2.1.0'], # added as suspected Azure SDK was needed for Blob Store access.
        )
        if verbose:
            print('Launching engine {} on queue {}'.format(self._engine_type, queue_id or queue_name))
        engine_task.enqueue(task=engine_task, queue_name=queue_name, queue_id=queue_id) `

Once this was solved, the Triton server was still reporting not being able to find the model.
Investigations of the triton_model_service_update_step method found that the expected name of the locally cached model was just simply http://model.pt , however the process of copying from the local cached directory to the directory for serving resulted in the original filename being used, and the model was not found. The target path below shows the final location of the model ready for the Triton server, which is clearly not http://model.pt .

[INFO] Target Path:: /models/cub200_resnet34/1/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt [INFO] Local Path:: /clearml_agent_cache/storage_manager/global/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt
As a temporary fix, I added the following at the bottom of the method to rename the weights file to the expected name:

new_target_path = Path(os.path.join(target_path.parent),'model.pt') shutil.move(target_path.as_posix(), new_target_path.as_posix())
This has resulted in the model finally being found, but now we have different errors!!!
😞 😞 😞

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

After finally getting the model to be recognized by the Triton server, it now fails with the attached error messages.
Any ideas AgitatedDove14 ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Yes this is Triton failing to load the actual model file

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is it vanilla pytorch ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Does this file look familiar to you?
file not found: archive/constants.pkl

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It’s an ignite framework trained PyTorch model using one of the three well known vision model packages, TIMM, PYTORCHCV or TORCHVISION,

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

I don’t have a scooby doo what that pickle file is.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Google to the rescue:
https://github.com/pytorch/pytorch/issues/47917

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Fixes and identified issues can be found in these github comments.
Closing the discussion here.

https://github.com/allegroai/clearml-serving/issues/3

https://github.com/allegroai/clearml-serving/issues/5

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Answers 30

original

qouted