AgitatedDove14
Ok, after configuration file huge detour, we are now back to fixing genuine issues here.
To recap, in order to get the Triton container to run and to be able to connect to Azure Blob Storage, the following changes were made to the launch_engine
method of the ServingService
class:
For the task creation call:
The docker string was changed remove the port specifications [to avoid the port conflicts error]. The addition of packages argument was required, as the docker container environment required the azure-storage-blob
python package in order to connect to Azure Blob Storage to download the model weights.
` def launch_engine(self, queue_name, queue_id=None, verbose=True):
# type: (Optional[str], Optional[str], bool) -> None
"""
Launch serving engine on a specific queue
:param queue_name: Queue name to launch the engine service running the inference on.
:param queue_id: specify queue id (unique stand stable) instead of queue_name
:param verbose: If True print progress to console
"""
# todo: add more engines
if self._engine_type == 'triton':
# create the serving engine Task
engine_task = Task.create(
project_name=self._task.get_project_name(),
task_name="triton serving engine",
task_type=Task.TaskTypes.inference,
repo=" ` ` ", # for testing purposes, get the forked copy of the clearml-serving package.
branch="main",
#commit="b6355a1db8da307750e37e9cb37a5fc23876c8dd", # need to grab the latest commit automatically
script="clearml_serving/triton_helper.py",
working_directory=".",
docker="nvcr.io/nvidia/tritonserver:21.03-py3 --ipc=host ", # removed -p 8000:8000 -p 8001:8001 -p 8002:8002
argparse_args=[('serving_id', self._task.id), ],
add_task_init_call=False,
packages=['azure-storage-blob==2.1.0'], # added as suspected Azure SDK was needed for Blob Store access.
)
if verbose:
print('Launching engine {} on queue {}'.format(self._engine_type, queue_id or queue_name))
engine_task.enqueue(task=engine_task, queue_name=queue_name, queue_id=queue_id) `
Once this was solved, the Triton server was still reporting not being able to find the model.
Investigations of the triton_model_service_update_step
method found that the expected name of the locally cached model was just simply http://model.pt , however the process of copying from the local cached directory to the directory for serving resulted in the original filename being used, and the model was not found. The target path below shows the final location of the model ready for the Triton server, which is clearly not http://model.pt .
[INFO] Target Path:: /models/cub200_resnet34/1/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt [INFO] Local Path:: /clearml_agent_cache/storage_manager/global/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt
As a temporary fix, I added the following at the bottom of the method to rename the weights file to the expected name:
new_target_path = Path(os.path.join(target_path.parent),'model.pt') shutil.move(target_path.as_posix(), new_target_path.as_posix())
This has resulted in the model finally being found, but now we have different errors!!!
😞 😞 😞