Yes the agent is running in venv mode afaik. As for why it’s running as root - I’ll ask our engineer …
Ok so my train.py
now looks like this:
print("Before import")
from pathlib import Path
import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig
from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks
from clearml import Task
for i in range(torch.cuda.device_count()):
print(torch.cuda.get_device_properties(i).name)
print("Before task")
task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")
print("After task")
And the log looks like this:
Starting Task Execution:
Before import
2024-07-19 09:06:09
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
Before task
So it looks like it's getting stuck at Task.init
Hi @<1523701205467926528:profile|AgitatedDove14> , I reordered the imports:
from clearml import Task
print("Before task")
task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")
print("After task")
print("Before import")
from pathlib import Path
import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig
from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks
for i in range(torch.cuda.device_count()):
print(torch.cuda.get_device_properties(i).name)
And here's the output:
Environment setup completed successfully
Starting Task Execution:
Before task
Still looks like it's getting stuck at Task.init
Hmm, I'm without, no reason why it will get stuck .
Removing all the auto loggers, this can be done with
Task.init(..., auto_connect_frameworks=False)
which would disconnect all the automatic loggers (Hydra etc) off course this is for debugging purposes
Ah yes you were right, it does still print on remote. Here you go:
environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/root', 'LOGNAME': 'root', 'USER': 'root', 'SHELL': '/bin/bash', 'INVOCATION_ID': '2cf51dc43b78470cb14c29f5f653ee18', 'JOURNAL_STREAM': '8:224108', 'SYSTEMD_EXEC_PID': '134947', 'PYTHONUNBUFFERED': '1', 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID', 'CLEARML_WORKER_ID': 'mrl-plswh100:0', 'TRAINS_WORKER_ID': 'mrl-plswh100:0', 'CLEARML_CONFIG_FILE': '/tmp/.clearml_agent.vw6k62pq.cfg', 'TRAINS_CONFIG_FILE': '/tmp/.clearml_agent.vw6k62pq.cfg', 'CLEARML_TASK_ID': 'b0abe1da01bd4539a8e06699141c893a', 'TRAINS_TASK_ID': 'b0abe1da01bd4539a8e06699141c893a', 'CLEARML_LOG_LEVEL': 'INFO', 'TRAINS_LOG_LEVEL': 'INFO', 'CLEARML_LOG_TASK_TO_BACKEND': '0', 'TRAINS_LOG_TASK_TO_BACKEND': '0', 'PYTHONPATH': '/root/.clearml/venvs-builds/3.9/task_repository/ml-queue-test:/root/.clearml/venvs-builds/3.9/task_repository/ml-queue-test::/usr/lib64/python39.zip:/usr/lib64/python3.9:/usr/lib64/python3.9/lib-dynload:/root/.clearml/venvs-builds/3.9/lib64/python3.9/site-packages:/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages'})
Here's what the agent was logging:
anjum.sayed@M209886 clearml-agent --debug daemon --queue default
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.clearml.dev.mrl:443
DEBUG:urllib3.connectionpool:
"PUT /auth.login HTTP/1.1" 200 603
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.clearml.dev.mrl:443
DEBUG:urllib3.connectionpool:
"PUT /v2.5/queues.get_all HTTP/1.1" 200 344
DEBUG:urllib3.connectionpool:
"PUT /v2.5/queues.get_all HTTP/1.1" 200 332
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): updates.clear.ml:443
DEBUG:clearml_agent.session:Run by interpreter: C:\Users\anjum.sayed\AppData\Local\Programs\Python\Python39\python.exe
Current configuration (clearml_agent v1.8.1, location: C:\Users\anjum.sayed/clearml.conf):
----------------------
agent.worker_id =
agent.worker_name = M209886
agent.force_git_ssh_protocol = true
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < '3.10'
agent.package_manager.pip_version.1 = <22.3 ; python_version >\= '3.10'
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = nvidia
agent.package_manager.conda_channels.3 = defaults
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.package_manager.poetry_files_from_repo_working_dir = false
agent.venvs_dir = C:/Users/anjum.sayed/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.venvs_cache.path = ~/.clearml/venvs-cache
agent.vcs_cache.enabled = true
agent.vcs_cache.path = C:/Users/anjum.sayed/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = C:/Users/anjum.sayed/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = C:/Users/anjum.sayed/.clearml/pip-cache
agent.docker_apt_cache = C:/Users/anjum.sayed/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
agent.enable_task_env = false
agent.sanitize_config_printout = ****
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.abort_callback_max_timeout = 1800
agent.docker_internal_mounts.sdk_cache = /clearml_agent_cache
agent.docker_internal_mounts.apt_cache = /var/cache/apt/archives
agent.docker_internal_mounts.ssh_folder = ~/.ssh
agent.docker_internal_mounts.ssh_ro_folder = /.ssh
agent.docker_internal_mounts.pip_cache = /root/.cache/pip
agent.docker_internal_mounts.poetry_cache = /root/.cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /root/.clearml/vcs-cache
agent.docker_internal_mounts.venv_build = ~/.clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /root/.clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script =
agent.disable_task_docker_override = false
agent.git_user =
agent.git_pass = ****
agent.git_host =
agent.debug = true
agent.default_python = 3.9
agent.cuda_version = 123
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.http.default_method = put
api.auth.token_expiration_threshold_sec = ****
api.api_server =
api.web_server =
api.files_server =
api.credentials.access_key = 1N33K4IXUYO64HVT4S3PXVDIX4K2CS
api.credentials.secret_key = ****
api.host =
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.network.file_upload_retries = 3
sdk.aws.s3.key =
sdk.aws.s3.secret = ****
sdk.aws.s3.region =
sdk.aws.s3.use_credentials_chain = false
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.aws.boto3.multipart_threshold = 8388608
sdk.aws.boto3.multipart_chunksize = 8388608
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
sdk.development.worker.report_event_flush_threshold = 100
sdk.development.worker.console_cr_flush_period = 10
sdk.apply_environment = false
sdk.apply_files = false
DEBUG:clearml_agent.commands.worker:starting resource monitor thread
Worker "M209886:0" - Listening to queues:
+----------------------------------+---------+-------+
| id | name | tags |
+----------------------------------+---------+-------+
| 3e9973e15a6048c5ae5419ea7d097f9c | default | |
+----------------------------------+---------+-------+
DEBUG:urllib3.connectionpool:
"PUT /workers.register HTTP/1.1" 200 278
Running CLEARML-AGENT daemon in background mode, writing stdout/stderr to C:\Users\ANJUM~1.SAY\AppData\Local\Temp\.clearml_agent_daemon_outg5aq488v.txt
DEBUG:urllib3.connectionpool:
"PUT /v2.5/queues.get_all HTTP/1.1" 200 337
DEBUG:urllib3.connectionpool:
"PUT /workers.get_runtime_properties HTTP/1.1" 404 371
DEBUG:urllib3.connectionpool:
"PUT /v2.14/queues.get_next_task HTTP/1.1" 200 282
.................. truncating due to Slack char limit.........
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.5/tasks.ping HTTP/1.1" 200 271
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"POST /events.add_batch HTTP/1.1" 200 315
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /workers.status_report HTTP/1.1" 200 283
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /workers.status_report HTTP/1.1" 200 283
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.5/tasks.ping HTTP/1.1" 200 271
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:
"PUT /v2.5/tasks.get_by_id HTTP/1.1" 200 3490
DEBUG:urllib3.connectionpool:
"PUT /v2.5/tasks.stopped HTTP/1.1" 200 304
INFO:clearml_agent.commands.worker:Task process terminated
INFO:clearml_agent.commands.worker:Task interrupted: stopping
DEBUG:urllib3.connectionpool:
"POST /events.add_batch HTTP/1.1" 200 315
DEBUG:urllib3.connectionpool:
"PUT /v2.5/tasks.stopped HTTP/1.1" 200 333
DEBUG:urllib3.connectionpool:
"PUT /workers.status_report HTTP/1.1" 200 283
DEBUG:urllib3.connectionpool:
"PUT /v2.5/queues.get_all HTTP/1.1" 200 337
DEBUG:urllib3.connectionpool:
"PUT /v2.14/queues.get_next_task HTTP/1.1" 200 282
DEBUG:urllib3.connectionpool:
"PUT /workers.unregister HTTP/1.1" 200 280
DEBUG:urllib3.connectionpool:
"PUT /workers.unregister HTTP/1.1" 200 280
It’s a Dell XE9680 rack server with 8xH100s which is located in our office, running AlmaOS. We have successfully run training jobs on it inside Docker (without ClearML) which work fine (will check with my team if we’ve got something to train without Docker). I’ve also tried different Python versions; 3.9 (Alma default) and 3.11 which you can see in the log above. It’s a really bizarre issue and outside of print statements I’m not really sure where to look.
You mentioned sync argparser & reporting, so I’ll try removing Hydra to rule that out, and other loggers in PL and see from there …
Hi @<1523701205467926528:profile|AgitatedDove14> , here's my code with some more prints:
from clearml import Task
print("Before Task.init")
task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
print("Before task.set_repo")
task.set_repo(
repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
print("Before task.set_packages")
task.set_packages("requirements.txt")
print("After task")
print("Before import")
from pathlib import Path
import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig
from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks
print("After import")
I've attached the full log (using RC2). Still getting stuck at Task.init
- very weird
I've added that flag, removed all PL loggers & callbacks and all references to Hydra, but no luck 😞
This is so odd,
could you add prints right after the Task.init?
Also could you verify it still gets stuck with the latest RC
clearml==1.16.3rc2
Hi @<1724960464275771392:profile|DepravedBee82>
After
Starting Task Execution:
It will literally start the process running your code,
Can you send the full log of the Task? what is the code doing? which system is running the agent (i.e. Windows/Mac/Linux docker etc)
Will try non-root and get back to you. I’m also trying to reproduce on a different machine too
Nope - confirmed to be running on the OS's Python environment,
okay so bare metal root is definitely not recommended.
I'm not sure how/why it get's stuck though 😞
Any chance you can run the agent as non-root?
Also maybe preferred in docker mode, so it is easier for you to control the environment of the Task
My money is on the Redis container although comparing the logs between Kube & Docker Desktop, nothing looks out of the ordinary...
Nope - confirmed to be running on the OS's Python environment, although he said that the agent was supposed to have it's own user - looking into that now
Can you add before the Task.init
import os
print(os.environ)
@<1724960464275771392:profile|DepravedBee82> I just realized, the agent is Not running in docker mode, correct? (i.e. venv mode)
If this is the case how come it is running as root? (could it be is is running inside a container? how was that container spinned?)
My understanding is that on remote execution Task.init is supposed to be a no-op right?
Okay I have an idea, it could be a lock that another agent/user is holding on the cache folder or similar
Let me check something
Thank you! Although it's still really weird how it was failing silently - would it be worth changing the logging level for that error somewhere?
confirmed that the change had been added by
Make sure you see them in the Task log in the UI (the agent print it when it starts)
Any insight on how we can reproduce the issue?
Can this be reproducible using a simple script that we can also run?
He confirmed that it’s not inside a container. Trying to figure out why it’s running as root but would it make a difference if it was? Is it better to run the agent from a user profile?
Edit: it might be a container! Just checking now...
Thanks Martin - will try that and see what I can find. Really appreciate your patience with this! 🙂
This is exactly my problem, too, which I described above! If you find any solution, would be glad if you could share. 🙂 Of course, I also share mine when I get one.
Hi @<1724960464275771392:profile|DepravedBee82> , can you perhaps add a simple print at the start of your code before any import?
Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc.
The only thing that I can think of is that something is not right the the load balancer on the server so maybe some requests coming from an instance on the cluster are blocked ...
Hmm, saying that aloud that actually could be?! Try to add the following line to the end of the clearml.conf on the machine running the agent:
api.http.default_method: "put"