Reputation
Badges 1
93 × Eureka!After finally getting the model to be recognized by the Triton server, it now fails with the attached error messages.
Any ideas AgitatedDove14 ?
So I've been testing bits and pieces individually.
For example, I made a custom image for the VMSS nodes, which is based on Ubuntu and has multiple CUDA versions installed, as well as conda and docker pre-installed.
I'm managed to test the setup script, so that it executes on a pristine node, and results in a compute node being added to the relevant queue, but that's been executed manually by me, as I have the credentials to log on via SSH.
And I had to do things get the clearml-server the ma...
Thanks CostlyOstrich36 , you can also get access to the keys in the Azure Storage Explorer.
Looking at the Properties section gives the secure keys.
Good question, SuccessfulKoala55
My thoughts are orbiting around environment orchestration and having a bit more control over how an environment is created. I understand that the easiest form of the configuration is to implement it on the clearml-agent side and run a daemon with the configuration as required, whether that be using venv's or docker containers. Of course this limits the deployment type to the queue that the daemon is listening to.
I was considering if that by exposing the...
Fixes and identified issues can be found in these github comments.
Closing the discussion here.
I have also tried training a variety of network architectures from a number of libraries (Torchvision, pytorchcv, TIMM), as well as a simple VGG implementation from scratch, and come across the same issues.
The following was reported by the agent during the setup phase of the compute environment on the remote compute resource:
Log file is attached.
AgitatedDove14 I would love to help the project.
I am just about to move house, which is stressful enough without a global pandemic(!), so until that's completed I won't commit to anything. However, once settled in the new place, and I have a bit more time, I would very much welcome contributing.
AgitatedDove14
So can you verify it can download the model ?
Unfortunately it's still falling over, but then I got the same result for the credentials using both URI strings, the original, and the modified version, so it points to something else going on.
I note that the StorageHelper.get() method has a call which modifies the URI prior to it being passed to the function which gets the storage account and container name. However, when I run this locally, it doesn't seem to do a...
Just ran a model which pulled the dataset from the Azure Blob Storage and that seemed to looked correct.
2021-06-04 13:34:21,708 - clearml.storage - INFO - Downloading: 13.00MB / 550.10MB @ 32.59MBs from Birds%2FDatasets/cub200_2011_train_dataset.37a8f00931b04952a1500e3ada831022/artifacts/data/dataset.37a8f00931b04952a1500e3ada831022.zip 2021-06-04 13:34:21,754 - clearml.storage - INFO - Downloading: 21.00MB / 550.10MB @ 175.54MBs from ` Birds%2FDatasets/cub200_2011_train_dataset...
Like AnxiousSeal95 says, clearml server will version a dataset for you and push it to a unified storage place, as well as make it differenceable.
I’ve written a workshop on how to train image classifiers for the problem of bird species identification and recently I’ve adapted it to work with clearml.
There is an example workbook on how to upload a dataset to clearml server, in this a directory of images. See here: https://github.com/ecm200/caltech_birds/blob/master/notebooks/clearml_add...
AgitatedDove14
Ok, after configuration file huge detour, we are now back to fixing genuine issues here.
To recap, in order to get the Triton container to run and to be able to connect to Azure Blob Storage, the following changes were made to the launch_engine method of the ServingService class:
For the task creation call:
The docker string was changed remove the port specifications [to avoid the port conflicts error]. The addition of packages argument was required, as the doc...
FYI, I am training the model again, this time in a project which is not nested, just to rule out any funnies with regards to issues with nested projects.
It’s an ignite framework trained PyTorch model using one of the three well known vision model packages, TIMM, PYTORCHCV or TORCHVISION,
AgitatedDove14 Thanks for that.
I suppose the same would need to be done for any client PC running clearml such that you are submitting dataset upload jobs?
That is, the dataset is perhaps local to my laptop, or on a development VM that is not in the clearml system, but I from there I want to submit a copy of a dataset, then I would need to configure the storage section in the same way as well?
I assume the account name and key refers to the storage account credentials that you can f...
I am bit confused because I can see configuration sections Azure storage in the clearml.conf files, but these are on the client pc and the clearml-agent compute nodes.
So do these parameters have to be set on the clients and compute nodes individually, or is something that can be set on the server?