Sure. I'm in Europe but we can also test things async.
I actually ran into the exact same problem. The agents aren't hosted on AWS though, just a in-house server.
In the end I forked the clearml-session library and removed mechanisms to access the interactive terminal. I added ipc=host.
There's one identifiable issue with clearml-session+tailscale though - while it does launch the daemon properly, it registers the wrong ip address to the task (sometimes the external ip address even when --external is not passed). At the end of the day, if we know which machine it was launched on, we're able to replace that ip address with a tailscale equivalent and still connect. When ipc=host is active, we're able to query the network interfaces, and if there is a tailscale (typically tailscale0
) network interface, we can query it to get the ip address of that and register it with the task. This could possibly be exposed as an arg in the cli as something like clearml-session --docker .... --tailscale
I'm happy to work on a PR if you are interested
@<1535069219354316800:profile|PerplexedRaccoon19> can you verify the container uses the same docker arg as specified in the previous message?
@<1636537836679204864:profile|RipeOstrich93> , can you make sure that the Additional ClearML Configuration for the autoscaler app includes agent.extra_docker_arguments: ["--ipc=host", ]
?
@<1523701087100473344:profile|SuccessfulKoala55> Could you elaborate? I believe both the ips are visible to the container.
This is making things slightly complicated because now I have to introduce a jumphost for people who aren’t on the same physical network and are on the same tail scale network
Would love to know if there's a fix, it's currently blocking my use (jupyter notebooks hosted on EC2)
I think I'm running into the same issue? Using the webapp and the AWS Autoscaler app. Everything gets started up properly (can be seen in instance/experiment logs) but SSH fails, seemingly timing out. Tried with coreml-session --public-ip True
also, same issue (though the IP that it's attempting to connect to evidently changes)
@<1535069219354316800:profile|PerplexedRaccoon19> the clearml-session uses the ip published on the task by the code running as part of the session task to connect to the session - this is basically an issue of the IP visible from within the container where the session code is running remotely
so the 192.xxxx network is the physical network, and not on the tailscale network
This is the issue
Setting up connection to remote session
Starting SSH tunnel to root@192.168.1.185, port 10022
SSH tunneling failed, retrying in 3 seconds
Hi @<1535069219354316800:profile|PerplexedRaccoon19> can you please elaborate on the issue?