And do you have any network proxy or load balancer or firewall between the client running clearml-session and the server?
Script i am running is hello.py with code "from clearml import Task
task = Task.init(project_name="mlops", task_name="Say Hellow")
task.execute_remotely(queue_name="P2000")
print("Hello")" console output " clearml-session --jupyter-lab true --queue P2000 --base-task-id=515159dab92d4baabcb6b3647263a144
clearml-session - CLI for launching JupyterLab / VSCode on a remote machine
Verifying credentials
Use previous queue (resource) 'P2000' [Y]/n? Y
Interactive session config:
{
"base_task_id": "515159dab92d4baabcb6b3647263a144",
"git_credentials": false,
"jupyter_lab": true,
"keepalive": false,
"password": "************",
"queue": "P2000",
"remote_ssh_port": "22",
"username": "mlopsadmin",
"vscode_server": true
}
Launch interactive session [Y]/n? Y
Removing stale interactive sessions
Cloning base session 515159dab92d4baabcb6b3647263a144
Configuring new session
New session created [id=0e9cd1cdbba44fad87e7742a7e25af8f]
Waiting for remote machine allocation [id=0e9cd1cdbba44fad87e7742a7e25af8f]
.Status [queued]
..Status [in_progress] - queued pulled by agent
Remote machine allocated
Setting remote environment [Task id=0e9cd1cdbba44fad87e7742a7e25af8f]
Setup process details: None
Waiting for environment setup to complete [usually about 20-30 seconds, see last log line/s below]
task 0e9cd1cdbba44fad87e7742a7e25af8f pulled from a3039785e5d54587a36a4af3e310bf73 by worker
WORKER:gpu0
- urllib3==2.0.2
Environment setup completed successfully
Starting Task Execution:
ClearML results page: None
HelloProcess completed successfully
ERROR: Remote setup failed (status=completed) see details: None " and task logs attached.
This means the clearml-session client cannot reach the ClearML server - did you configure the clearml.conf file where you're running the clearml-session CLI?
@<1523701087100473344:profile|SuccessfulKoala55> Agent is running outside Kubernetes on a standalone VM running Ubuntu 22.04
I see the issue is you're using a --base-task-id
, but the base task you're using is your own custom task, which does not have the interactive session settings in it. This is an advanced feature. If you want to use a base task, I'd recommend first starting without it, than examining the task created by the interactive session to figure out what exactly you need
@<1523701087100473344:profile|SuccessfulKoala55> When I add extra index url , it gives error for certificate and I am not sure where to configure all these settings in agent settings
OK, so the server is hosted in k8s, where is the agent running?
And what are the details? (The task log)
Looks like your elastic search on the server has some issue, possibly with storage, can you share the elastic search logs?
@<1523701087100473344:profile|SuccessfulKoala55> Thanks .. I will try it and let you know. I have one more question . I have installed latest version of clearML server and now I see issue with Urllib3 V2 which will fix next week with new releases. How can I install old version with helm chart which is stable and working ?
@<1523701087100473344:profile|SuccessfulKoala55> It was blocked on Load balancer and after allowing traffic , it is working. Thanks a lot !!
Again, this is a network issue, it might have something to do with the different requests sent by the CLI when you use it this way (larger requests, with payloads, etc.) - are you using some proxy or is the server hosted on a cloud provider?
@<1523701087100473344:profile|SuccessfulKoala55> It’s on prem server and remote agent . Both remote agent and my machine are in same network and I can ssh agent from my machine. Do we needs to be open others than SSH to make jupyterlab working from my computer to agent or agent to ClearML server ?
Did you set up SSL termination on the server? how do you access the web UI?
An easier approach might be to inject into the docker container (using the init bash script) or preparing in the image in adnavce something like:
[global]
extra-index-url =
cert = /path/to/my/bundle.pem
In /etc/pip.conf
@<1523701087100473344:profile|SuccessfulKoala55> Thanks a lot , it worked !!! However i am getting Error when i open ClearML web application - Fetch tag failed "Error 0 : You can't write against a read only replica." DO you now if this is known issue and fix available for it.
@<1523701087100473344:profile|SuccessfulKoala55> How can I install latest one. Do you have link to refer ?
@<1523701087100473344:profile|SuccessfulKoala55> Yes, We have Load balancer which provide IP to ClearML Server and it is working for all operation like normal task creation , remote training and all but only clearml-session is not working.
@<1523701087100473344:profile|SuccessfulKoala55> As I mentioned earlier, If I do not specify —base-task-Id than error is as below @Jake command clearml-session --jupyter-lab but getting blow error "Launch interactive session [Y]/n? Y
Removing stale interactive sessions
Creating new session
Retrying (Retry(total=237, connect=240, read=237, redirect=240, status=240)) after connection broken by
'ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))': /v2.23/tasks.edit
Retrying (Retry(total=236, connect=240, read=236, redirect=240, status=240)) after connection broken by
'ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))': /v2.23/tasks.edit
Retrying (Retry(total=235, connect=240, read=235, redirect=240, status=240)) after connection broken by
'ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))': /v2.23/tasks.edit
Retrying (Retry(total=234, connect=240, read=234, redirect=240, status=240)) after connection broken by
'ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))': /v2.23/tasks.edit"
You'll have to set up the apt repository in the container - see here for example: None
@<1523701087100473344:profile|SuccessfulKoala55> It’s hosted on kubernetes and behind the ingress controller. I use helm char provided on clearML page with ingress set as true. I can access web UI from browser and currently it is on http only.
Hi @<1562973095227035648:profile|ThoughtfulOctopus83> , if the agent can reach the ClearML Server, it should work. If you have a proxy for pip packages and ubuntu updates, you'll need to configure extra index URL for pip (using the agent.package_manager.extra_index_url
setting (see here ). If you're using the agent in docker mode, than it will be trying to install ubuntu packages in the spawned docker container, so you will need to either use a docker image already set up with the proxy, or make sure the proxy is set up in the init bash script (under agent.docker_preprocess_bash_script
)
@<1523701087100473344:profile|SuccessfulKoala55> When I use docker I see it go out for NVIDIA , Ubuntu and pip package. I can fix pip via above but what about other NVIDIA and Ubuntu ?
The urllib3 issue is not related to the server, it's an SDK issue that was resolved in an RC and also by an official release last night
@<1523701087100473344:profile|SuccessfulKoala55> Yes, I am able to create Clearml task and perform training from same machine. only when i start clearml-session this error coming. Do i need to specia config in clearml.conf file for clearml session to work ? Just to add However when i run this command , it work and execute task but do not give any interative jupyter or code url.
clearml-session --jupyter-lab true --queue P2000 --base-task-id=515159dab92d4baabcb6b3647263a144 , it run the task and at the end give error ERROR: Remote setup failed (status=completed) see details:
clearml-session - CLI for launching JupyterLab / VSCode on a remote machine
Verifying credentials
Use previous queue (resource) 'P2000' [Y]/n? Y
Interactive session config:
{
"base_task_id": "515159dab92d4baabcb6b3647263a144",
"git_credentials": false,
"jupyter_lab": true,
"keepalive": false,
"password": "*********",
"queue": "P2000",
"remote_ssh_port": "22",
"username": "mlopsadmin",
"vscode_server": true
}
Launch interactive session [Y]/n? Y
Removing stale interactive sessions
Cloning base session 515159dab92d4baabcb6b3647263a144
Configuring new session
New session created [id=f17d0e89a3ad43bf93e455a23109ccce]
Waiting for remote machine allocation [id=f17d0e89a3ad43bf93e455a23109ccce]
.Status [queued]
Remote machine allocated
Setting remote environment [Task id=f17d0e89a3ad43bf93e455a23109ccce]
Setup process details: None
Waiting for environment setup to complete [usually about 20-30 seconds, see last log line/s below]
task f17d0e89a3ad43bf93e455a23109ccce pulled from a3039785e5d54587a36a4af3e310bf73 by worker
Worker:gpu0
- urllib3==2.0.2
Environment setup completed successfully
Starting Task Execution:
ClearML results page: None
HelloProcess completed successfully
ERROR: Remote setup failed (status=completed) see details: None
@<1523701087100473344:profile|SuccessfulKoala55> Thanks a lot !!! Its fixed after i redeployed container. Could you please help me to fix clearml-session, I am running command clearml-session --jupyter-lab but getting blow error "Launch interactive session [Y]/n? Y
Removing stale interactive sessions
Creating new session
Retrying (Retry(total=237, connect=240, read=237, redirect=240, status=240)) after connection broken by
'ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))': /v2.23/tasks.edit
Retrying (Retry(total=236, connect=240, read=236, redirect=240, status=240)) after connection broken by
'ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))': /v2.23/tasks.edit
Retrying (Retry(total=235, connect=240, read=235, redirect=240, status=240)) after connection broken by
'ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))': /v2.23/tasks.edit
Retrying (Retry(total=234, connect=240, read=234, redirect=240, status=240)) after connection broken by
'ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))': /v2.23/tasks.edit"