Discovered An Issue With Clearml-Session Where We Have The Agents Running Within A Tailscale Network. When The Clearml Session Is Local On The Same Physical Network, Connections Work Fine. But When We Are On The Virtual Network, They Dont Work Fine

Answered

discovered an issue with clearml-session where we have the agents running within a tailscale network.

When the clearml session is local on the same physical network, connections work fine. But when we are on the virtual network, they dont work fine

  				
Posted 
	one year ago

					More
				  		
  Report
		
					PerplexedRaccoon19
				
					0
					 × 1

Votes Newest

Answers 14

I actually ran into the exact same problem. The agents aren't hosted on AWS though, just a in-house server.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Hi @<1535069219354316800:profile|PerplexedRaccoon19> can you please elaborate on the issue?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

want to work on it together?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					PerplexedRaccoon19
				
					0
					 × 1

Sure. I'm in Europe but we can also test things async.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

@<1636537836679204864:profile|RipeOstrich93> , can you make sure that the Additional ClearML Configuration for the autoscaler app includes agent.extra_docker_arguments: ["--ipc=host", ] ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

In the end I forked the clearml-session library and removed mechanisms to access the interactive terminal. I added ipc=host.

There's one identifiable issue with clearml-session+tailscale though - while it does launch the daemon properly, it registers the wrong ip address to the task (sometimes the external ip address even when --external is not passed). At the end of the day, if we know which machine it was launched on, we're able to replace that ip address with a tailscale equivalent and still connect. When ipc=host is active, we're able to query the network interfaces, and if there is a tailscale (typically tailscale0 ) network interface, we can query it to get the ip address of that and register it with the task. This could possibly be exposed as an arg in the cli as something like clearml-session --docker .... --tailscale

I'm happy to work on a PR if you are interested

  				
Posted 
	one year ago

					More
				  		
  Report
		
					PerplexedRaccoon19
				
					0
					 × 1

And where is the agent running?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

@<1523701087100473344:profile|SuccessfulKoala55> Could you elaborate? I believe both the ips are visible to the container.

This is making things slightly complicated because now I have to introduce a jumphost for people who aren’t on the same physical network and are on the same tail scale network

  				
Posted 
	one year ago

					More
				  		
  Report
		
					PerplexedRaccoon19
				
					0
					 × 1

Would love to know if there's a fix, it's currently blocking my use (jupyter notebooks hosted on EC2)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					RipeOstrich93
				
					0

@<1535069219354316800:profile|PerplexedRaccoon19> the clearml-session uses the ip published on the task by the code running as part of the session task to connect to the session - this is basically an issue of the IP visible from within the container where the session code is running remotely

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

This is the issue

Setting up connection to remote session
Starting SSH tunnel to root@192.168.1.185, port 10022
SSH tunneling failed, retrying in 3 seconds

  				
Posted 
	one year ago

					More
				  		
  Report
		
					PerplexedRaccoon19
				
					0
					 × 1

so the 192.xxxx network is the physical network, and not on the tailscale network

  				
Posted 
	one year ago

					More
				  		
  Report
		
					PerplexedRaccoon19
				
					0
					 × 1

I think I'm running into the same issue? Using the webapp and the AWS Autoscaler app. Everything gets started up properly (can be seen in instance/experiment logs) but SSH fails, seemingly timing out. Tried with coreml-session --public-ip True also, same issue (though the IP that it's attempting to connect to evidently changes)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					RipeOstrich93
				
					0

@<1535069219354316800:profile|PerplexedRaccoon19> can you verify the container uses the same docker arg as specified in the previous message?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

14 Answers

one year ago