Hi AgitatedDove14 , I am trying to run clearml-session
on my laptop, but it seems to keep running at "Waiting for environment setup to complete [usually about 20-30 seconds]" for several minutes. How could I debug and resolve it?
I do not see any error in https://app.community.clear.ml/projects/368fb3c4fcdd419e8b597ed100c29d69/experiments/bf78f1c303c74062986384cd74f0e542/info-output/log?columns=selected&columns=type&columns=name&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=active_duration&order=last_update , and I can access the Jupyter, but I did not see access information of VSCode server and SSH server. Not sure what the issue is?
AgitatedDove14 Yes I have an agent running. Otherwise, it would keep running at "Waiting for remote machine allocation . [Status]"
I do not know how to check the TCP connection?
BTW, I just tried the command clearml-session
again, and now it would stop with error "docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].". It means I could not use a remote machine without GPU?
Okay i will try it, thank you very much
I actually read that documentation but more specifically i need an example on how to do it if possible. As i mentioned itried to run clearml-session but it takes forever and nothing happen
AgitatedDove14 Hi, for remote machine, I'm switching to Ubuntu server + docker + NVIDIA GPU, instead of using Windows. I run the clearml-agent with docker on the Ubuntu server.
Now everything looks fine on the server after I started the clearml-session on my laptop, which means SSH/VSCode/Jupyter servers are created and I got the URLs.
However, on my laptop it is showing error:Remote machine is ready Setting up connection to remote session Starting SSH tunnel ssh: connect to host 172.17.0.2 port 10022: No route to host
Any idea how to resolve it? I don't know why I'm getting 172.17.0.2
while the IP of my remote machine is 10.19.20.15
. Is 172.17.0.2
the internal IP only accessible from the running docker? If so, how to expose the IP to my laptop?
Hi JumpyDragonfly13
- is "10.19.20.15" accessible from your machine (i.e. can you ping to it)?
- Can you manually SSH to 10.19.20.15 on port 10022 ?
Basically run the 'agentin virtual environment mode JumpyDragonfly13 try this one (notice no --docker flag)
clearml-agent daemon --queue interactive --create-queue Then from the "laptop" try to get a remote session with:
clearml-session `
Hi JumpyDragonfly13 , just making sure, do you have an agent running on a remote machine ?
Can you have a direct TCP connection to the remote machine (the default port it will use is 10022)
Hi AgitatedTurtle16
You can find documentation here:
https://github.com/allegroai/clearml-session
Basically it uses the cleaml-agents to launch a session on one of the machines in the cluster.
In the remote session itself it install jupyterlab + vscode-server, then it connects to the remote session (running on the agent's machine) automatically over ssh and creates tunnel to these services.
AgitatedDove14 Hi, thanks for the response.
I tried to change the IP address as indicated above, but now clearml-session
is showing the error:ssh: connect to host 10.19.20.15 port 10022: Connection refused
Info to help you reproduce FYI:clearml-session
: version 0.3.2 Ubuntu: version 20.04.2 LTS docker specified for the interactive session: nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 command-line used to spin the agent: clearml-session --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
AgitatedTurtle16 from the screenshot, it seems the Task is stuck in the queue. which means there is no agent running to actual run the interactive session.
Basic setup:
A machine running clearml-agent
(this is the "remote machine") A machine running cleaml-session (let's call it laptop 🙂 )You need to first start the agent on the "remote machine" (basically call clearml-agent daemon --docker --queue default
), Once the agent is running on the remote machine, from your laptop run cleaml-session
select the default queue (the one the rermote machine is listening to), and wait until you get the http links.
AgitatedDove14 Yes thanks, it seems relevant. So, how to run without docker? We'd like to try it without docker first.
Hi JumpyDragonfly13
I don't know why I'm gettingÂ
172.17.0.2
I think it (the remote jupyter Task) fails to get the correct IP address of the server.
You can manually correct it by going to the DevOps project, look for the runnig Task there, then under Configuration/Properties change external_address
to the actual IP 10.19.20.15
Once that is done, re-run the clearml-session
, it will suggest to connect to the running session, it should work....
BTW:
I'd like to see if we can fix this issue, and it will be helpful to try to reproduce the server setupclearml-session
version?
Ubuntu version ?
What's the docker you specified for the interactive session?
What's the command-line you are using to spin the agent ?
Hi JumpyDragonfly13
Let's assume we have two machines, one we call remote, one we call laptop (at least for this discussion)
On the Remote machine we need to run: (notice we must have docker preinstalled on the remote machine, it can work without docker, let me know if this is the case for you)clearml-agent daemon --queue interactive --create-queue --docker
On the Laptop we runclearml-session --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
What clearml-session will do is create a "Task" and enqueue it on the "interactive" queue.
Then the Agent (on the remote machine) will take the "Task" spin the docker create JupyerLab & VSCode-server inside the docker and return links for us to connect to the Remote machine (notice the links are http://localhost because they are automatically tunneld over the SSH connection the clearml-session created for us in the background)
Make sense?
Hi AgitatedDove14
I tried the commands you suggested. The first command works fine, but the second command failed with the following message:docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.
Our remote machine is Windows 10 running docker (with WSL2), which seems not supporting NVIDIA GPU yet? Is it the reason that makes the 2nd command failed?
We'd like to try the case without docker. Please advise how to do it, thanks!
Our remote machine is Windows 10
JumpyDragonfly13 seems like the Windows 10 + docker is the issue (that would explain the OCI error)
Is this relevant ?
https://github.com/microsoft/WSL/issues/5100