Reputation
Badges 1
93 × Eureka!` Starting Task Execution:
usage: train_clearml_pytorch_ignite_caltech_birds.py [-h] [--config FILE]
[--opts ...]
PyTorch Image Classification Trainer - Ed Morris (c) 2021
optional arguments:
-h, --help show this help message and exit
--config FILE Path and name of configuration file for training. Should be a
.yaml file.
--opts ... Modify config options using the command-line 'KEY VALUE'
p...
SuccessfulKoala55
I can see the issue your are referring to regarding the execution of the triton docker image, however as far as I am aware, this was not something I explicitly specified. The ServingService.launch_service()
method from the ServingService
Class from the clearml-serving
package would appear to have both specified:
` def launch_engine(self, queue_name, queue_id=None, verbose=True):
# type: (Optional[str], Optional[str], bool) -> None
"""
...
This appears to confirm it as well.
https://github.com/pytorch/pytorch/issues/1158
Thanks AgitatedDove14 , you're very helpful.
I believe the standard shared allocation for a docker container is 64 MB, which is obviously not enough for training deep learning image classification networks, but I am unsure of the best solution to fix the problem.
Good question, SuccessfulKoala55
My thoughts are orbiting around environment orchestration and having a bit more control over how an environment is created. I understand that the easiest form of the configuration is to implement it on the clearml-agent side and run a daemon with the configuration as required, whether that be using venv's or docker containers. Of course this limits the deployment type to the queue that the daemon is listening to.
I was considering if that by exposing the...
I have rerun the serving example with my PyTorch job, but this time I have followed the MNIST Keras example.
I appended a GPU compute resource to the default queue and then executed the service on the default queue.
This resulted in a Triton serving engine container spinning up on the compute resource, however it failed due to the previous issue with ports conflicts:
` 2021-06-08 16:28:49
task f2fbb3218e8243be9f6ab37badbb4856 pulled from 2c28e5db27e24f348e1ff06ba93e80c5 by worker ecm-clear...
SuccessfulKoala55 I may have made some progress with this bug, but have stumbled onto another issue in getting the Triton service up and running.
See comments in the github issue.
AgitatedDove14 I would love to help the project.
I am just about to move house, which is stressful enough without a global pandemic(!), so until that's completed I won't commit to anything. However, once settled in the new place, and I have a bit more time, I would very much welcome contributing.
AgitatedDove14 Thanks for that.
I suppose the same would need to be done for any client PC running clearml such that you are submitting dataset upload jobs?
That is, the dataset is perhaps local to my laptop, or on a development VM that is not in the clearml system, but I from there I want to submit a copy of a dataset, then I would need to configure the storage section in the same way as well?
I assume the account name and key refers to the storage account credentials that you can f...
I'll just take a screenshot from my companies daily standup of data scientists and software developers..... that'll be enough!
In my case it's a Tesla P40, which has 24 GB VRAM.
If I did that, I am pretty sure that's the last thing I'd ever do...... 🤣
I checked the apiserver.log
file in /opt/clearml/logs
and this appears to be the related error when I try to publish an experiment:
` [2021-06-07 13:43:40,239] [9] [ERROR] [clearml.service_repo] ValidationError (Task:8a4a13bad8334d8bb53d7edb61671ba9) (setup_shell_script.StringField only accepts string values: ['container'])
Traceback (most recent call last):
File "/opt/clearml/apiserver/bll/task/task_operations.py", line 325, in publish_task
raise ex
File "/opt/clearml/a...
SuccessfulKoala55
Good news!
It looks like pulling the new clearml-server
version has solved the problem.
I can happily publish models.
Interestingly, I was able to publish models before using this server, so I must have inadvertently updated something that has caused a conflict.
Pffff security.
Data scientist be like....... 😀
Network infrastructure person be like ...... 😱
This potentially might be a silly question, but in order to get the inference working, I am assuming that no specific inference script has to be written for handling the model?
This is what the clearml-serving package takes care of, correct?
SuccessfulKoala55 A second queued job which executed on the same node, but didn't this time need to cache the dataset locally as it was done by the previous experiment, hasn't had this issue.
That all being said, apart from the console reporting looking messy, it doesn't appear to have impacted the training, or indeed the metric collection of the first experiment where it occurred.
This job did download pre-trained weights, so the only difference between them is the local dataset cache.
Hi SuccessfulKoala55
Thanks for the input.
I was actually about to grab the new docker_compose.yml
and pull the new images.
Weirdly it was working before, so what's changed?
I don't believe I've updated the agents or the clearml sdk on the experiment submission vm either.
I will definitely update the server now, and report back.
FYI, I am training the model again, this time in a project which is not nested, just to rule out any funnies with regards to issues with nested projects.
Oh, so this applies to VRAM, not RAM?
Yup, I can confirm that's the case.
I have just literally installed the latest commit via the master branch and it works.
AgitatedDove14
Ok, after configuration file huge detour, we are now back to fixing genuine issues here.
To recap, in order to get the Triton container to run and to be able to connect to Azure Blob Storage, the following changes were made to the launch_engine
method of the ServingService
class:
For the task creation call:
The docker string was changed remove the port specifications [to avoid the port conflicts error]. The addition of packages argument was required, as the doc...
Just another thought, this couldn’t be caused by using a non default location for clearml.conf
?
I have a clearml.conf
in the default location which is configured for training agents and I created a separate one for the inference service and put it in a sub folde of my home dir. The agent on the default queue to be used for inference serving was execute using clearml-agent daemon —config-file /path/to/clearml.conf
Ok I think I managed to create a docker image of the Triton instance server, just putting the kids to bed, will have a play afterwards.
After finally getting the model to be recognized by the Triton server, it now fails with the attached error messages.
Any ideas AgitatedDove14 ?