Hi JitteryCoyote63 , can I assume you can ssh into the machine directly?
ssh my-instance @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ED25519 key sent by the remote host is SHA256:O2++ST5lAGVoredT1hqlAyTowgNwlnNRJrwE8cbMLo0. Please contact your system administrator. Add correct host key in /Users/H4dr1en/.ssh/known_hosts to get rid of this message. Offending ECDSA key in /Users/H4dr1en/.ssh/known_hosts:81 Host key for 10.105.1.77 has changed and you have requested strict checking. Host key verification failed.
But after that you're connected to the machine and can work on it?
So this message appears when I try to ssh directly into the instance
If I don’t start clearml-session
, I can easily connect to the agent, so clearml-session is doing something that messes up the ssh config and prevent me from ssh into the agent afterwards
So I cannot ssh anymore to the agent after starting clearml-session on it
CostlyOstrich36 How is clearml-session setting the ssh config?
JitteryCoyote63 this is standard ssh authorized server removal
https://superuser.com/a/30089
specifically you can try:ssh-keygen -R 10.105.1.77
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
Is it being used to ssh to the instance?
It is used for the SSH client so it "knows" the SSH server (does that make sense) ?
AgitatedDove14 Yes with the command you shared I can now ssh again manually to the agent, but I still clearml-agent will raise the same error
but I still clearml-agent will raise the same error
which one?
sorry, the clearml-session. The error is the one I shared at the beginning of this thread
JitteryCoyote63 are you running the agent in docker mode ?
This is the reason you are getting an error 🙂
Basically the session asks the agent to setup a new SSH server with credentials on the remote machine, this is not an issue inside a container, as this is an isolated environment, but when running in venv mode the User running the agent is not root, hence it cannot spin/configure an SSH server.
Make sense ?
I understand, but then why the docker mode is an option of the CLI if we always have to use it so that it works?
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
This is a good point! I'll make sure we stress it (BTW: it will work with elevated credentials, but probably not recommended)
(BTW: it will work with elevated credentials, but probably not recommended)
What does that mean? Not sure to understand
Does the agent install the nvidia-container toolkit, so that GPUs of the instance can be accessed from inside the docker running jupyterlab?
This is the prerequisites of the docker service installed on the host machine (where the agent is running)
Basically follow: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
https://docs.docker.com/compose/gpu-support/
AgitatedDove14 https://clear.ml/docs/latest/docs/apps/clearml_session/#running-in-docker in the docs there is a --docker
option, that’s what confuses me, since the agent should always run in docker mode
Yes, the agent's mode is global, i.e. all tasks are either inside docker or in venv. In theory you can have two agents on the same machine one venv one docker listening to two diff queues