Reputation
Badges 1
93 × Eureka!The following code is the training script that was used to setup the experiment. This code has been executed on the server in a separate conda environment and verified to run fine (minus the clearml code).
` from future import print_function, division
import os, pathlib
Clear ML experiment
from clearml import Task, StorageManager, Dataset
Local modules
from cub_tools.trainer import Ignite_Trainer
from cub_tools.args import get_parser
from cub_tools.config import get_cfg_defaults
#...
The error after the first iteration as follows:
` [INFO] Executing model training...
1621437621593 ecm-clearml-compute-gpu-001:0 DEBUG Epoch: 0001 TrAcc: 0.296 ValAcc: 0.005 TrPrec: 0.393 ValPrec: 0.000 TrRec: 0.296 ValRec: 0.005 TrF1: 0.262 ValF1: 0.000 TrTopK: 0.613 ValTopK: 0.026 TrLoss: 3.506 ValLoss: 5.299
Current run is terminating due to exception: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger...
I should say, the company I am working Malvern Panalytical, we are developing an internal MLOps capability, and we are starting to develop a containerized deployment system, for developing, training and deploying machine learning models. Right now we are at the early stages of development, and our current solution is based on using Azure MLOps, which I personally find very clunky.
So I have been tasked with investigating alternatives to replace the training and model deployment side of thing...
This was the code:
` import os
import argparse
# ClearML modules
from clearml import Dataset
parser = argparse.ArgumentParser(description='CUB200 2011 ClearML data uploader - Ed Morris (c) 2021')
parser.add_argument(
'--dataset-basedir',
dest='dataset_basedir',
type=str,
help='The directory to the root of the dataset',
default='/home/edmorris/projects/image_classification/caltech_birds/data/images')
parser.add_argument(
'--clearml-project',
dest='clearml_projec...
Fixes and identified issues can be found in these github comments.
Closing the discussion here.
Pffff security.
Data scientist be like....... 😀
Network infrastructure person be like ...... 😱
I also think clearing out the venv directory before executing the change of package manager really helped.
Is there a helper function option at all that means you can flush the clearml-agent working space automatically, or by command?
SuccessfulKoala55 I may have made some progress with this bug, but have stumbled onto another issue in getting the Triton service up and running.
See comments in the github issue.
Or even better dataset_v1.2.34_alpha_with_that_thingy_change_-2_copy_copy.zip
So, AgitatedDove14 what I really like about the approach with ClearML is that you can genuinely bring the architecture into the development process early. That has a lot of desirable outcomes, including versioning and recording of experiments, dataset versioning etc. Also it would enforce a bit more structure in project development, if things are required to fit into a bit more of a defined box (or boxes). However, it also seems to be not too prescriptive, such that I would worry that a lot...
You need to make sure the user is part of the docker
group.
Follow these commands post install of Docker engine, and don't forget to restart the terminal session for the changes to take full effect .
` sudo groupadd docker
sudo usermod -aG docker ${USER} `Don't install Docker engine with root, your sysadmin will have kittens!
So moving onto the container name.
Original code has the following calls:
if not f.path.segments: raise ValueError( "URI {} is missing a container name (expected " "[https/azure]://<account-name>.../<container-name>)".format( uri ) ) container = f.path.segments[0]
Repeating the same commands locally results in the following:
` >>> f_a.path.segments
['artefacts', 'Caltech Birds%2FTraining', 'TRAIN...
AnxiousSeal95 , I would also warmly second what EnviousStarfish54 says regarding end to end use cases of real case studies, with a dataset that is more realistic than say MNIST or the like, so it is easier to see how to structure things.
I understand one of the drivers has been flexibility with robustness when you need it, however as a reference point from the people who made it, then examples of how you the creators would structure things would help in our thinking of how we might use it....
This potentially might be a silly question, but in order to get the inference working, I am assuming that no specific inference script has to be written for handling the model?
This is what the clearml-serving package takes care of, correct?
Oh it's a load balancer, so it does that and more.
But I suppose the point holds though, it provides an end-point for external locations, and then handles the routing to the correct resources.
What I really like about ClearML is the potential for capturing development at an early stage, as it requires only minimal adjustment of code for it be in the very least captured as an experiment, even if it is run locally on ones machine.
What we would like ideally, is a system where development, training, and deployment are almost one and the same thing, to reduce the lead time from development code to production models. Removing as many translation layers as you can between the developmen...
Issue #337 opened in the clearml repository.
After finally getting the model to be recognized by the Triton server, it now fails with the attached error messages.
Any ideas AgitatedDove14 ?
Just another thought, this couldn’t be caused by using a non default location for clearml.conf
?
I have a clearml.conf
in the default location which is configured for training agents and I created a separate one for the inference service and put it in a sub folde of my home dir. The agent on the default queue to be used for inference serving was execute using clearml-agent daemon —config-file /path/to/clearml.conf
This one got me the first time I set it up as well.
The default settings taken from the environment variables, the docker-compose file and the code itself.
You only need add configuration items that you want to change from the default.
It’s an ignite framework trained PyTorch model using one of the three well known vision model packages, TIMM, PYTORCHCV or TORCHVISION,
Hmmmm, I thought it logged it with the terminal results when it was uploading weights, but perhaps that's only the live version and the saved version is pruned? Or my memory is wrong.... it is Friday after all!
Can't find anymore reference to it, sorry.
Like AnxiousSeal95 says, clearml server will version a dataset for you and push it to a unified storage place, as well as make it differenceable.
I’ve written a workshop on how to train image classifiers for the problem of bird species identification and recently I’ve adapted it to work with clearml.
There is an example workbook on how to upload a dataset to clearml server, in this a directory of images. See here: https://github.com/ecm200/caltech_birds/blob/master/notebooks/clearml_add...
WearyLeopard29 no I wasn’t able to do that although I didn’t explicitly try.
I was wondering if this was as a high a security risk then the web portal?
Access is controlled by keys, whereas the web portal is not.
I admit I’m a data scientist, so any proper IT security person would probably end up a shivering wreck in the corner of the room if they saw some of my common security practises. I do try to be secure, but I am not sure how good I am at it.
AgitatedDove14 I think the major issue is working out how to get the setup of the node dynamically passed to the VMSS so when it creates a node it does the following:
Provisions the correct environment for the clearml-agent
. Installs the clearml-agent
and sets up the clearml.conf
file with the access credentials for the server and file storage. Executes the clearml-agent
on the correct queue, ready for accepting jobs.
In Azure VMSS, there is a method called "Cust...
If I did that, I am pretty sure that's the last thing I'd ever do...... 🤣
AgitatedDove14 ,
Often a question is asked about a data science project at the beginning, which are like "how long will that take?" or "what are the chances it will work to this accuracy?".
To the uninitiated, these would seem like relatively innocent and easy to answer questions. If a person has a project management background, with more clearly defined technical tasks like software development or mechanical engineering, then often work packages and uncertainties relating to outcomes are m...
Crawls out from under the table and takes a deep breath
AgitatedDove14 you remember we talked about it being a bug or a stupid.....
Well, it's a stupid by me.... somehow I managed to propagate irregularities in the clearml.conf
file such that it successfully loaded, but the expected nested structure was not there.
When the get_local_copy()
method requested the model, it correctly got the azure credentials, however when the StorageHelper
class tries to get the azure cr...