Reputation
Badges 1
34 × Eureka!My understanding is that on remote execution Task.init is supposed to be a no-op right?
Can this be reproducible using a simple script that we can also run?
Not really unfortunately - happy to share my code, but I've managed to reproduce this with different codebases.
As a summary of what I've tried:
- Agent on the H100 machine, Server on Kube - Fail
- Agent on laptop, Server on Kube - Fail
- Agent on laptop, Server on Docker Desktop - Pass
So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube rather than an issue with the agents or code....
Hi @<1523701087100473344:profile|SuccessfulKoala55> thanks for the reply! The output above is from grep -i network /var/log/syslog
on the machine running the agent. That's good to hear that clearml is pretty resilient to network outages ๐ . Do you have any suggestions on how we can start tracking down the cause of this?
This is the only clue that was logged to the console in clearml server: 2024-11-21 06:57:13 Process terminated by user
. The first errors on the agent logs appea...
Also is there a way to disable this by default?
The reason I ask is that I want to send many jobs to a queue via the CLI. so I don't really want to be messing around with Task.init()
.
I've even tried renaming my files to *pth and *.data to stop this behaviour
Hi @<1523701205467926528:profile|AgitatedDove14> , here's my code with some more prints:
from clearml import Task
print("Before Task.init")
task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
print("Before task.set_repo")
task.set_repo(
repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
print("Before task.set_packages")
task.set_packages("requirements.txt")
print("After task")
print("Before import")
from pathlib import Path...
Looking at the logs in the Kube pods now for anything that looks unusual...
For reference, the clearml agent is running in its own user profile in Ubuntu 24.04 (so that it doesn't run as root as per previous discussions)
Yes the agent is running in venv mode afaik. As for why itโs running as root - Iโll ask our engineer โฆ
Thank you! Although it's still really weird how it was failing silently - would it be worth changing the logging level for that error somewhere?
Hmm no change after adding that unfortunately (confirmed that the change had been added by clearml-agent config
) ๐
I managed to set up my (Windows) laptop as a worker and reproduce the issue. Would that suggest an issue with ClearML server?
Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc. work, but if there are any obvious things we should check, let me know and I can ask our DevOps engineer
It seems like the worker lost network connectivity, and then aborted the jobs ๐
2024-11-21T06:56:01.958962+00:00 mrl-plswh100 systemd-networkd-wait-online[2279529]: Timeout occurred while waiting for network connectivity.
2024-11-21T06:56:01.976055+00:00 mrl-plswh100 apt-helper[2279520]: E: Sub-process /lib/systemd/systemd-networkd-wait-online returned an error code (1)
2024-11-21T06:57:15.810747+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_...
Here's what the agent was logging:
๎ฐ anjum.sayed@M209886 ๎ฐ ๎ฌ ๎ฐ clearml-agent --debug daemon --queue default
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.clearml.dev.mrl:443
DEBUG:urllib3.connectionpool:
"PUT /auth.login HTTP/1.1" 200 603
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.clearml.dev.mrl:443
DEBUG:urllib3.connectionpool:
"PUT /v2.5/queues.get_all HTTP/1.1" 200 344
DEBUG:urllib3.connectionpool:
...
Which auto_connect_*
arg do I use and what value to I set it to? At the end of my training run I'm making .png plots of everything in my test set, and I don't want these to be logged as artifacts.
It's not covered here either: None
I was hoping something like output_uri=False
would work, but looking at the source code, I don't think that would work @<1523701070390366208:profile|CostlyOstrich36>
Ah yes you were right, it does still print on remote. Here you go:
environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/root', 'LOGNAME': 'root', 'USER': 'root', 'SHELL': '/bin/bash', 'INVOCATION_ID': '2cf51dc43b78470cb14c29f5f653ee18', 'JOURNAL_STREAM': '8:224108', 'SYSTEMD_EXEC_PID': '134947', 'PYTHONUNBUFFERED': '1', 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID', 'CLEARML_WORKER_ID': 'mrl-plswh100:0', 'TRAINS_WORKER_ID': 'mrl-plswh100:0', 'CLEARM...
Hi @<1523701205467926528:profile|AgitatedDove14> , I reordered the imports:
from clearml import Task
print("Before task")
task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")
print("After task")
print("Before import")
from pathlib import Path
import hydra
import lightning as L
import torch
from coolname import generate_sl...
Thanks for the response @<1523701205467926528:profile|AgitatedDove14> ! The code is a small FMNIST test training job written in PyTorch Lightning. On my local job (laptop GPU, Windows) it completes in ~ 5min. On the server (Linux, H100s) it just hangs at Starting Task Execution:
. Neither of these are in Docker.
I would expect to see the standard PL progress bars outputted to the console, but since nothing is outputted, so I'm not sure how to go about debugging this. I've attached the ...
Thanks Martin - will try that and see what I can find. Really appreciate your patience with this! ๐
If there was an SSL issue it should log to console right?
ClearML is hosted on an on-prem kube cluster and to get it to log locally I needed to append my company cert to the file located at certifi.where()
. Do you think the same needs to be done for the Python installation for the worker?
I think I've found a clue after running with debug:
Before Task.init
Retrying (Retry(total=239, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
Retrying (Retry(total=238, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: ...
Itโs a Dell XE9680 rack server with 8xH100s which is located in our office, running AlmaOS. We have successfully run training jobs on it inside Docker (without ClearML) which work fine (will check with my team if weโve got something to train without Docker). Iโve also tried different Python versions; 3.9 (Alma default) and 3.11 which you can see in the log above. Itโs a really bizarre issue and outside of print statements Iโm not really sure where to look.
You mentioned sync argparse...
Nope - confirmed to be running on the OS's Python environment, although he said that the agent was supposed to have it's own user - looking into that now
Thanks John, but is there a way to do this via the CLI?
Or is Task.init()
the only way?
I just ran with this in my local task, and all the env vars were printed to console, but in ClearML they are not in the console log. Presumably that's because it's printed before ClearML is logging?
Hi all, we're still suffering this issue where jobs are seemingly randomly aborted. The only clue is this in the ClearML logs:
2024-12-13 06:16:30 Process terminated by user
The only pattern we can see is that it typically happens around 6-7am.
Any suggestions on how to debug this would be greatly appreciated!
He confirmed that itโs not inside a container. Trying to figure out why itโs running as root but would it make a difference if it was? Is it better to run the agent from a user profile?
Edit: it might be a container! Just checking now...
I've added that flag, removed all PL loggers & callbacks and all references to Hydra, but no luck ๐
@<1523701205467926528:profile|AgitatedDove14> we've now configured the server to have it's own user account to run the agent so it is no longer running as root, but no luck ๐
Before os.environ
environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/home/clearml', 'LOGNAME': 'clearml', 'USER': 'clearml', 'SHELL': '/bin/bash', 'INVOCATION_ID': 'da8e36a03c7348efbb7db360755e92b3', 'JOURNAL_STREAM': '8:244189055', 'SYSTEMD_EXEC_P...