Reputation
Badges 1
93 × Eureka!Issue #337 opened in the clearml repository.
SuccessfulKoala55
I can see the issue your are referring to regarding the execution of the triton docker image, however as far as I am aware, this was not something I explicitly specified. The ServingService.launch_service()
method from the ServingService
Class from the clearml-serving
package would appear to have both specified:
` def launch_engine(self, queue_name, queue_id=None, verbose=True):
# type: (Optional[str], Optional[str], bool) -> None
"""
...
I have created a Github issue 3 on the clearml-serving
repo.
` Starting Task Execution:
usage: train_clearml_pytorch_ignite_caltech_birds.py [-h] [--config FILE]
[--opts ...]
PyTorch Image Classification Trainer - Ed Morris (c) 2021
optional arguments:
-h, --help show this help message and exit
--config FILE Path and name of configuration file for training. Should be a
.yaml file.
--opts ... Modify config options using the command-line 'KEY VALUE'
p...
This job did download pre-trained weights, so the only difference between them is the local dataset cache.
If my memory serves me correctly, I think it happened on weights saving as well, let me just check an experiment log and see.
The following was reported by the agent during the setup phase of the compute environment on the remote compute resource:
Log file is attached.
I don’t have a scooby doo what that pickle file is.
Like AnxiousSeal95 says, clearml server will version a dataset for you and push it to a unified storage place, as well as make it differenceable.
I’ve written a workshop on how to train image classifiers for the problem of bird species identification and recently I’ve adapted it to work with clearml.
There is an example workbook on how to upload a dataset to clearml server, in this a directory of images. See here: https://github.com/ecm200/caltech_birds/blob/master/notebooks/clearml_add...
SuccessfulKoala55 New issue on securing server ports opened on clearml-server repo.
Ohhhhhhhhhhhhhhhhhhhh......that makes sense,
AgitatedDove14 Brilliant!
I will try this, thank you sir!
Right, I am still a bit confused to be honest.
We all remember the days of dataset_v1.2.34_alpha_with_that_thingy_change_-2.zip
EnviousStarfish54 we are at the beginning phases exploring potential solutions to MLops. So I have only been playing with the tools, including the dataset side of things. However, I think that an integral part of capturing a model in its entirety is being able to make sure that you know what into making it. So I see being able to version and difference datasets as just as important as the code, or the environment in which it is run.
When I run the commands above you suggested, if I run them on the compute node but on the host system within conda environment I installed to run the agent daemon from, I get the issues as we appear to have seen when executing the Triton inference service.
` (py38_clearml_serving_git_dev) edmorris@ecm-clearml-compute-gpu-002:~$ python
Python 3.8.10 (default, May 19 2021, 18:05:58)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
...
Hi AgitatedDove14 ,
Thanks for your points.
I have updated and https://github.com/allegroai/clearml-agent/issues/66 relating to this issue.
For completeness, you were right it being an issue with PyTorch, there is a breaking issue with PyTorch 1.8.1 and Cuda 11.1 when installed via PIP.
It's recommended that PyTorch be installed with Conda to circumvent this issue.
This appears to confirm it as well.
https://github.com/pytorch/pytorch/issues/1158
Thanks AgitatedDove14 , you're very helpful.
Does "--ipc=host" make it a dynamic allocation then?
In my case it's a Tesla P40, which has 24 GB VRAM.
I believe the standard shared allocation for a docker container is 64 MB, which is obviously not enough for training deep learning image classification networks, but I am unsure of the best solution to fix the problem.
If I did that, I am pretty sure that's the last thing I'd ever do...... 🤣
I'll just take a screenshot from my companies daily standup of data scientists and software developers..... that'll be enough!
Hi SuccessfulKoala55
Thanks for the input.
I was actually about to grab the new docker_compose.yml
and pull the new images.
Weirdly it was working before, so what's changed?
I don't believe I've updated the agents or the clearml sdk on the experiment submission vm either.
I will definitely update the server now, and report back.
SuccessfulKoala55
Good news!
It looks like pulling the new clearml-server
version has solved the problem.
I can happily publish models.
Interestingly, I was able to publish models before using this server, so I must have inadvertently updated something that has caused a conflict.
I checked the apiserver.log
file in /opt/clearml/logs
and this appears to be the related error when I try to publish an experiment:
` [2021-06-07 13:43:40,239] [9] [ERROR] [clearml.service_repo] ValidationError (Task:8a4a13bad8334d8bb53d7edb61671ba9) (setup_shell_script.StringField only accepts string values: ['container'])
Traceback (most recent call last):
File "/opt/clearml/apiserver/bll/task/task_operations.py", line 325, in publish_task
raise ex
File "/opt/clearml/a...
AgitatedDove14 Ok I can do that.
I was just thinking it through.
Would this be best if it were executed in the Triton execution environment?