Hi GentleSwallow91 ! Thanks for the warm words! 🙏 😍
Basically:
Redis is used for various server temporary state caches and workers state management Elastic is used for metrics storage and indexing (logs, scalars, plots, debug images etc.) Mongo is used for all other metadata and artifact references Actual artifact data / debug images and model weights are stored in various object storage solutions, starting with the built-in ClearML fileserver and including S3, GCS, Azure storage and similar solutions.
Also, if you want to share any frustration and feedback on setting up, we're always looking to improve and provide a better experience 🙂
Thanks Jake SuccessfulKoala55 !
I used to have problems with clearML agents and multi-GPU training with agents - have put it on hold.
Now my problem is with ClearML serving.
I have managed to run a demo https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving_tutorial
But had problems :clearml-serving --id c605bf64db3740989afdd9bee87e6353 model add --engine sklearn --endpoint "test_model_sklearn" --preprocess "examples/sklearn/preprocess.py" --name "initial model training" --project "serving in action"
This command never got completed successfully since it could not find a model - I am specifying there project and task name that was created while training a model with python3 examples/sklearn/train_model.py
I have managed to go further with specifying model-id
parameter.
Otherwise I would get the following:
` $ clearml-serving --id c605bf64db3740989afdd9bee87e6353 model add --engine sklearn --endpoint "test_model_sklearn" --preprocess "examples/sklearn/preprocess.py" --name "initial model training" --project "serving in action"
clearml-serving - CLI for launching ClearML serving engine
Serving service Task c605bf64db3740989afdd9bee87e6353, Adding Model endpoint '/test_model_sklearn/'
Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f
Error: Could not fine any Model to serve {'project_name': 'serving in action', 'model_name': 'initial model training', 'tags': None, 'only_published': False, 'include_archived': False} `A minor stuff - change the error message to "Could not find any Model".
I think that problem was due to the fact that model name = experiment name + sklearn model - see the screenshot
All in all the test serving worked out and I have a bunch of tasks in DevOps project.
In tutorial I am lacking explanation what is happening under the hood, i.e.:
What does serving service controller do? What does all the containers do for servingContainer clearml-serving-alertmanager Container clearml-serving-inference Container clearml-serving-statistics I have a bunch of tasks running in DevOps projects - some with identical names - is it normal?
Hi GentleSwallow91 let me try and answer your questions 😄
The serving service controller is basically, the main Task that controls the serving functionality itself. AFAIK: clearml-serving-alertmanager - a container that runs the alertmanager by prometheus ( https://prometheus.io/docs/alerting/latest/alertmanager/ ) clearml-serving-inference - the container that runs inference code clearml-serving-statistics - I believe that it runs software that reports to the prometheus reporting either generic statistics and user defined ones Are you sure it's not old runs that were not terminated? Or once your terminate your clearml-serving it closes all of them?