///[Please Note, All The Below Was Executed On The Command Line Of The Compute Node,

Unanswered

I have rerun the serving example with my PyTorch job, but this time I have followed the MNIST Keras example.
I appended a GPU compute resource to the default queue and then executed the service on the default queue.
This resulted in a Triton serving engine container spinning up on the compute resource, however it failed due to the previous issue with ports conflicts:

2021-06-08 16:28:49 task f2fbb3218e8243be9f6ab37badbb4856 pulled from 2c28e5db27e24f348e1ff06ba93e80c5 by worker ecm-clearml-compute-gpu-002:0 2021-06-08 16:28:49 Running Task f2fbb3218e8243be9f6ab37badbb4856 inside docker: nvcr.io/nvidia/tritonserver:21.03-py3 arguments: ['--ipc=host', '-p', '8000:8000', '-p', '8001:8001', '-p', '8002:8002'] 2021-06-08 16:28:50 Executing: ['docker', 'run', '-t', '--gpus', 'all', '--ipc=host', '-p', '8000:8000', '-p', '8001:8001', '-p', '8002:8002', '-e', 'CLEARML_WORKER_ID=ecm-clearml-compute-gpu-002:0', '-e', 'CLEARML_DOCKER_IMAGE=nvcr.io/nvidia/tritonserver:21.03-py3 --ipc=host -p 8000:8000 -p 8001:8001 -p 8002:8002', '-v', '/tmp/.clearml_agent.ft8vulpe.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.j9b8arhf:/root/.ssh', '-v', '/home/edmorris/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/edmorris/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/edmorris/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/edmorris/.clearml/cache:/clearml_agent_cache', '-v', '/home/edmorris/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvcr.io/nvidia/tritonserver:21.03-py3', 'bash', '-c', 'apt-get update ; apt-get install -y git ; . /opt/conda/etc/profile.d/conda.sh ; conda activate base ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id f2fbb3218e8243be9f6ab37badbb4856'] 2021-06-08 16:28:55 docker: Error response from daemon: driver failed programming external connectivity on endpoint wonderful_galileo (0c2feca5684f2f71b11fa1e8da4550d42b23c456e52ba0069d0aae64cd75f55b): Error starting userland proxy: listen tcp4 0.0.0.0:8001: bind: address already in use. 2021-06-08 16:28:55 Process failed, exit code 125

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

214 Views

0 Answers

3 years ago

2 years ago