Reputation
Badges 1
93 × Eureka!SuccessfulKoala55 However, this was the first time an experiment with this dataset was executed on this compute node. I have been doing a lot of trial and error with this setup to get the models training, and so on my first compute node, I had the data downloading locally quite early on, so I haven't seen the script have to download a local dataset cache as it was already done.
` Starting Task Execution:
usage: train_clearml_pytorch_ignite_caltech_birds.py [-h] [--config FILE]
[--opts ...]
PyTorch Image Classification Trainer - Ed Morris (c) 2021
optional arguments:
-h, --help show this help message and exit
--config FILE Path and name of configuration file for training. Should be a
.yaml file.
--opts ... Modify config options using the command-line 'KEY VALUE'
p...
AgitatedDove14 Brilliant!
I will try this, thank you sir!
The following code is the training script that was used to setup the experiment. This code has been executed on the server in a separate conda environment and verified to run fine (minus the clearml code).
` from future import print_function, division
import os, pathlib
Clear ML experiment
from clearml import Task, StorageManager, Dataset
Local modules
from cub_tools.trainer import Ignite_Trainer
from cub_tools.args import get_parser
from cub_tools.config import get_cfg_defaults
#...
Ah ok, so it's the query string you use with the SAS box. Great.
I have created a Github issue 3 on the clearml-serving
repo.
Thanks CostlyOstrich36 , you can also get access to the keys in the Azure Storage Explorer.
Looking at the Properties section gives the secure keys.
I checked the apiserver.log
file in /opt/clearml/logs
and this appears to be the related error when I try to publish an experiment:
` [2021-06-07 13:43:40,239] [9] [ERROR] [clearml.service_repo] ValidationError (Task:8a4a13bad8334d8bb53d7edb61671ba9) (setup_shell_script.StringField only accepts string values: ['container'])
Traceback (most recent call last):
File "/opt/clearml/apiserver/bll/task/task_operations.py", line 325, in publish_task
raise ex
File "/opt/clearml/a...
Hi SuccessfulKoala55
Thanks for the input.
I was actually about to grab the new docker_compose.yml
and pull the new images.
Weirdly it was working before, so what's changed?
I don't believe I've updated the agents or the clearml sdk on the experiment submission vm either.
I will definitely update the server now, and report back.
SuccessfulKoala55
Good news!
It looks like pulling the new clearml-server
version has solved the problem.
I can happily publish models.
Interestingly, I was able to publish models before using this server, so I must have inadvertently updated something that has caused a conflict.
FYI, I am training the model again, this time in a project which is not nested, just to rule out any funnies with regards to issues with nested projects.
Like AnxiousSeal95 says, clearml server will version a dataset for you and push it to a unified storage place, as well as make it differenceable.
I’ve written a workshop on how to train image classifiers for the problem of bird species identification and recently I’ve adapted it to work with clearml.
There is an example workbook on how to upload a dataset to clearml server, in this a directory of images. See here: https://github.com/ecm200/caltech_birds/blob/master/notebooks/clearml_add...
You need to make sure the user is part of the docker
group.
Follow these commands post install of Docker engine, and don't forget to restart the terminal session for the changes to take full effect .
` sudo groupadd docker
sudo usermod -aG docker ${USER} `Don't install Docker engine with root, your sysadmin will have kittens!
I think perhaps as standard, the group docker
is already created.
The bit that isn't done is making your user part of that group.
SuccessfulKoala55
I can see the issue your are referring to regarding the execution of the triton docker image, however as far as I am aware, this was not something I explicitly specified. The ServingService.launch_service()
method from the ServingService
Class from the clearml-serving
package would appear to have both specified:
` def launch_engine(self, queue_name, queue_id=None, verbose=True):
# type: (Optional[str], Optional[str], bool) -> None
"""
...
I have rerun the serving example with my PyTorch job, but this time I have followed the MNIST Keras example.
I appended a GPU compute resource to the default queue and then executed the service on the default queue.
This resulted in a Triton serving engine container spinning up on the compute resource, however it failed due to the previous issue with ports conflicts:
` 2021-06-08 16:28:49
task f2fbb3218e8243be9f6ab37badbb4856 pulled from 2c28e5db27e24f348e1ff06ba93e80c5 by worker ecm-clear...
This potentially might be a silly question, but in order to get the inference working, I am assuming that no specific inference script has to be written for handling the model?
This is what the clearml-serving package takes care of, correct?
SuccessfulKoala55 I may have made some progress with this bug, but have stumbled onto another issue in getting the Triton service up and running.
See comments in the github issue.
I dip in and out of Docker, and that one gets me almost every time!
I love the new design of the site.
When is clearml-deploy coming to the open source release?
Or is this a commercial only part?
When I run the commands above you suggested, if I run them on the compute node but on the host system within conda environment I installed to run the agent daemon from, I get the issues as we appear to have seen when executing the Triton inference service.
` (py38_clearml_serving_git_dev) edmorris@ecm-clearml-compute-gpu-002:~$ python
Python 3.8.10 (default, May 19 2021, 18:05:58)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
...
SuccessfulKoala55 A second queued job which executed on the same node, but didn't this time need to cache the dataset locally as it was done by the previous experiment, hasn't had this issue.
That all being said, apart from the console reporting looking messy, it doesn't appear to have impacted the training, or indeed the metric collection of the first experiment where it occurred.
AgitatedDove14
So can you verify it can download the model ?
Unfortunately it's still falling over, but then I got the same result for the credentials using both URI strings, the original, and the modified version, so it points to something else going on.
I note that the StorageHelper.get()
method has a call which modifies the URI prior to it being passed to the function which gets the storage account and container name. However, when I run this locally, it doesn't seem to do a...
My bad you are correct, it is as you say.
Understood.
SuccessfulKoala55 I point you to my disclaimer above......😬
SuccessfulKoala55
SUCCESS!!!
This appears to be working.
Setup certifications us sudo certbot --nginx
.
Then edit the default configuration file in /etc/nginx/sites-available
` server {
listen 80;
return 301 https://$host$request_uri;
}
server {
listen 443;
server_name your-domain-name;
ssl_certificate /etc/letsencrypt/live/your-domain-name/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/your-domain-name/privkey.pem;
...
WearyLeopard29 no I wasn’t able to do that although I didn’t explicitly try.
I was wondering if this was as a high a security risk then the web portal?
Access is controlled by keys, whereas the web portal is not.
I admit I’m a data scientist, so any proper IT security person would probably end up a shivering wreck in the corner of the room if they saw some of my common security practises. I do try to be secure, but I am not sure how good I am at it.
So moving onto the container name.
Original code has the following calls:
if not f.path.segments: raise ValueError( "URI {} is missing a container name (expected " "[https/azure]://<account-name>.../<container-name>)".format( uri ) ) container = f.path.segments[0]
Repeating the same commands locally results in the following:
` >>> f_a.path.segments
['artefacts', 'Caltech Birds%2FTraining', 'TRAIN...