Unanswered
Hi Guys, I Am Trying To Upload And Serve A Pre-Existing 3-Rdparty Pytorch Model Inside My Clearml Cluster. However, After Proceeding With The Suggested Sequence Of Operations By Official Docs And Later Even Gpt O3, I Am Having Errors Which I Cannot Solve.
Ok, SuccessfulKoala55 , I was partially able to find one of the incorrect parts of my serving setup:
- Pytorch models inference require me to have .ENV file and clearml-serving-triton-gpu docker configured and running.
- Configuration of .ENV requires me to provide the clearml-serving Service ID, which was created by clearml-serving create.
- I have multiple services created via that command, as there is no command to remove the others, only to create additional ones.
- I found the serving service and its ID, which is automatically bound to run models, and it operates differently - no messages about failing to find models.
- BUT INSTEAD: it fails with Kafka, which is by some reason running by default and awaiting brokers, clients etc. Nothing like that was discussed in docs and clearml-serving tutorial, so now I am confused even more, tbh. I didn't create or have specific endpoints or connections to Kafka and related services - I didn't modify the contents of the clearml-serving-triton docker-compose files at all, only ENV file.
- Also, when I did this and restarted the triton-serving docker, the running inference tasks have multiplied for some reason. Now I have many duplicates, which do not stop from webui and there seems to be no way to remove them using the same webui... Also, they either have some sort of misconfiguration, as they either do not have an endpoint or model attached, or they have a model, but the erratic one from 3 months ago. And I listed before the only commands I used to create the serving services and add models to serving currently.
Screenshots will be attached as well as the logs.
Serving task (one with the globe icon in UI)
INFO Executing: ['docker', 'run', '-t', '-e', 'CLEARML_WORKER_ID=lab03:gpuall', '-e', 'CLEARML_DOCKER_IMAGE=', '-v', '/tmp/.clearml_agent.djxlonux.cfg:/root/clearml.conf', '-v', '/root/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/root/.clearml/pip-cache:/root/.cache/pip', '-v', '/root/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/root/.clearml/cache:/clearml_agent_cache', '-v', '/root/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', '', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent==0.17.1 ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 30fb54845e2345358a4701c117cb43b0']
1732496915556 lab03:gpuall DEBUG docker: invalid reference format.
See 'docker run --help'.
What did I do wrong, please, and why did the restart of the clearml-serving-triton docker compose produce even more service tasks? :D
33 Views
0
Answers
one month ago
one month ago