I Originally Posted In

Answered

I originally posted in but have reposted here upon request! 😸

Hey guys!

We are trialling the ClearML session as a workflow for team to train their models. If successful, we think the tool could be a major part of our workflow!

We have a custom docker container running, in which we have our own docker entrypoint where we authenticate to Tailscale so our sessions get their own private IP. As this is the case, we have our own SSH config that we'd like to use.

However, whenever we spin up a session, https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L583 always gets run and overwrites our configs.
As exec "$@" is the last line in our entrypoint (to get the clearml setup to finish gracefully) - I cannot find a way to edit configs after this file in run in the docker image.

Does anyone have any suggestions how to modify the interactive_session_task.py initialization in the flow, or more simply; be able to edit configs after the exec "$@" line that calls the final ClearML setup?

FYI, this is what the exec command runs. I assume the interactive_session_task is called by the final clearml_agent execute ?

bash -c echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0" ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL git" ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL python3-pip" ; [ -z "$CLEARML_APT_INSTALL" ] || (apt-get update && apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --full-monitoring --id some_clearml_id
-- Some more info --

The main problem is ldconfig being put in /etc/profile .
As we use a custom CUDA image, we do not want this running on user login, and get ugly error messages about missing symlinks.

Also, we have a custom motd that doesn't show due to the custom ClearML SSH installation in the setup steps. Ideally we would like SSH to use port 22 also rather than 10022.

The only way I see to fix these issues is to modify the interactive_session, or find a way to another bash entrypoint after the session_task setup.

The session orchestration seems great, and I am excited to show off the whole workflow for the team when I've got it integrated nicely!

Thanks a lot,
Saif 😊

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					LackadaisicalOtter14
				
					0
					 × 1

Votes Newest

Answers 10

Hey thanks informing me of this.
However, this doesn't help need to remove ldconfig from /etc/profile which is put there by the interactive_session_task 😕

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					LackadaisicalOtter14
				
					0
					 × 1

As we use a custom CUDA image, we do not want this running on user login, and get ugly error messages about missing symlinks.

You can customize the startup bash script (running inside Any container) here:
https://github.com/allegroai/clearml-agent/blob/bf07b7f76d3236c1118b81730c6d9718705a795a/docs/clearml.conf#L145
LackadaisicalOtter14 Would that help?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

That would be great! Might have to use 2>/dev/null in some of my bash scripts 😊

One other question regarding connecting. We have setup sshd inside the docker image we are using. I see that when we try to connect over port 22, it forwards to the host machine. I believe this is due to mounting ports on the host; which is possible as the spun up container has the capabilities:
'--cap-add=net_admin', '--cap-add=sys_module

Is there a way to disable this behaviour, and let the container run isolated from the host?
We use wireguard to tunnel into the container to port 22 on the same image when not instantiated with ClearML.
Thank you!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					LackadaisicalOtter14
				
					0
					 × 1

Hey,
Sorry for delay in replying.
The line causing problems is line 484 in the interactive_session_task
'echo "ldconfig" >> /etc/profile && '
When a user logs in, due to a custom cuda/torch version being used, when a user logs in; they are greeted with
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-cfg.so.470.63.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-compiler.so.470.63.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-ml.so.470.63.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-opencl.so.470.63.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libcuda.so.470.63.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.63.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-allocator.so.470.63.01 is empty, not checked. /sbin/ldconfig.real: Can't create temporary cache file /etc/ld.so.cache~: Permission denied
Is it possible to remove this line to stop it from being executed?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					LackadaisicalOtter14
				
					0
					 × 1

That would be great! Might have to use

2>/dev/null

in some of my bash scripts

Feel free to test and PR :)

One other question regarding connecting. We have setup sshd inside the docker image we are using.

Actually the remote session opens port 10022 on the host machine (so it does not collide with the default ssh port)
It actually runs an additional sshd inside the docker, setting its port.
And the clearml-session will ssh directly into the container sshd (port 10022), make sense ?

Is there a way to disable this behaviour, and let the container run isolated from the host?

what do you mean by that ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I have made a PR request.

Thanks you!!! 🎉 we will merge shortly 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi LackadaisicalOtter14

Is it possible to remove this line to stop it from being executed

Everything is possible 🙂 II think the main question is why it is there (which ti the best of my understanding, is to solve for any cuda drivers and installed packages, meaning anything that is installed in runtime)
I think we can suppress the error, wdyt?
'echo "ldconfig" 2>/dev/null >> /etc/profile && '

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

ldconfig from

/etc/profile

which is put there by the interactive_session_task

LackadaisicalOtter14 are you sure ? maybe this is done as part of the installation the interactive session runs ?
Could that be the issue ?
apt-get update && apt-get install -y openssh-server

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi LackadaisicalOtter14

However, whenever we spin up a session,

always gets run and overwrites our configs

what do you mean by that?
The what config are being overwritten? (generally speaking, it just add the OS environment it needs to for the setup process)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hey,

Do not worry about the SSH problem I mentioned, I understand now, thank you!

Regarding the ldconfig warning supression, I tested it and it works as expected!
I have made a PR request.

Thanks for your help AgitatedDove14 😊

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					LackadaisicalOtter14
				
					0
					 × 1

Write your answer

926 Views

10 Answers

2 years ago

one year ago