Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Hi all, I've successfully run a Task locally, and now I'm trying to clone it and send it to a Queue. It looks like the environment is built successfully, but it hangs here:

Environment setup completed successfully
Starting Task Execution:

Is there any way of figuring out why the remote Task hangs and how would I go about debugging it?

WebApp: 1.15.1-478 • Server: 1.15.1-478 • API: 2.29

  
  
Posted one month ago
Votes Newest

Answers 46


Ah yes you were right, it does still print on remote. Here you go:

environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/root', 'LOGNAME': 'root', 'USER': 'root', 'SHELL': '/bin/bash', 'INVOCATION_ID': '2cf51dc43b78470cb14c29f5f653ee18', 'JOURNAL_STREAM': '8:224108', 'SYSTEMD_EXEC_PID': '134947', 'PYTHONUNBUFFERED': '1', 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID', 'CLEARML_WORKER_ID': 'mrl-plswh100:0', 'TRAINS_WORKER_ID': 'mrl-plswh100:0', 'CLEARML_CONFIG_FILE': '/tmp/.clearml_agent.vw6k62pq.cfg', 'TRAINS_CONFIG_FILE': '/tmp/.clearml_agent.vw6k62pq.cfg', 'CLEARML_TASK_ID': 'b0abe1da01bd4539a8e06699141c893a', 'TRAINS_TASK_ID': 'b0abe1da01bd4539a8e06699141c893a', 'CLEARML_LOG_LEVEL': 'INFO', 'TRAINS_LOG_LEVEL': 'INFO', 'CLEARML_LOG_TASK_TO_BACKEND': '0', 'TRAINS_LOG_TASK_TO_BACKEND': '0', 'PYTHONPATH': '/root/.clearml/venvs-builds/3.9/task_repository/ml-queue-test:/root/.clearml/venvs-builds/3.9/task_repository/ml-queue-test::/usr/lib64/python39.zip:/usr/lib64/python3.9:/usr/lib64/python3.9/lib-dynload:/root/.clearml/venvs-builds/3.9/lib64/python3.9/site-packages:/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages'})
  
  
Posted one month ago

Hi @<1523701205467926528:profile|AgitatedDove14> , here's my code with some more prints:

from clearml import Task

print("Before Task.init")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
print("Before task.set_repo")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
print("Before task.set_packages")
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks

print("After import")

I've attached the full log (using RC2). Still getting stuck at Task.init - very weird

  
  
Posted one month ago

Hmm, I'm without, no reason why it will get stuck .
Removing all the auto loggers, this can be done with

Task.init(..., auto_connect_frameworks=False)

which would disconnect all the automatic loggers (Hydra etc) off course this is for debugging purposes

  
  
Posted one month ago

He confirmed that it’s not inside a container. Trying to figure out why it’s running as root but would it make a difference if it was? Is it better to run the agent from a user profile?

Edit: it might be a container! Just checking now...

  
  
Posted one month ago

Okay I have an idea, it could be a lock that another agent/user is holding on the cache folder or similar
Let me check something

  
  
Posted one month ago

My understanding is that on remote execution Task.init is supposed to be a no-op right?

  
  
Posted one month ago

@<1724960464275771392:profile|DepravedBee82> I just realized, the agent is Not running in docker mode, correct? (i.e. venv mode)
If this is the case how come it is running as root? (could it be is is running inside a container? how was that container spinned?)

  
  
Posted one month ago

Will try non-root and get back to you. I’m also trying to reproduce on a different machine too

  
  
Posted one month ago

I just ran with this in my local task, and all the env vars were printed to console, but in ClearML they are not in the console log. Presumably that's because it's printed before ClearML is logging?

  
  
Posted one month ago

Yes the agent is running in venv mode afaik. As for why it’s running as root - I’ll ask our engineer …

  
  
Posted one month ago

Please let me know what you find 🤞

  
  
Posted one month ago

Hi @<1523701205467926528:profile|AgitatedDove14> , I reordered the imports:

from clearml import Task

print("Before task")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks


for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_properties(i).name)

And here's the output:

Environment setup completed successfully
Starting Task Execution:
Before task

Still looks like it's getting stuck at Task.init

  
  
Posted one month ago

My understanding is that on remote execution Task.init is supposed to be a no-op right?

Not really a no-op, it would sync Argpasrer and the like, start background reporting services etc.

This is so odd! literally nothing printed
Can you tell me something about the node "mrl-plswh100:0" ?
is this like a sagemaker node? we have seen things similar where Python threads / subprocesses are not supported and instead of python crashing it just hangs there

  
  
Posted one month ago

Sorry, on the remote machine (i.e. enqueue it and let the agent run it), this will also log the print 🙂

  
  
Posted one month ago

I've added that flag, removed all PL loggers & callbacks and all references to Hydra, but no luck 😞

  
  
Posted one month ago

How are you starting the agent?

  
  
Posted one month ago

If there was an SSL issue it should log to console right?

ClearML is hosted on an on-prem kube cluster and to get it to log locally I needed to append my company cert to the file located at certifi.where() . Do you think the same needs to be done for the Python installation for the worker?

  
  
Posted one month ago

Can this be reproducible using a simple script that we can also run?

Not really unfortunately - happy to share my code, but I've managed to reproduce this with different codebases.

As a summary of what I've tried:

  • Agent on the H100 machine, Server on Kube - Fail
  • Agent on laptop, Server on Kube - Fail
  • Agent on laptop, Server on Docker Desktop - Pass
    So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube rather than an issue with the agents or code. As for which of the 7 containers could be at fault... :man-shrugging: . I'm not seeing anything out of the ordinary in the logs. Is there a verbose setting in the agent that could help us diagnose, i.e. each step of what goes on in Task.init ?
  
  
Posted one month ago

Although it's still really weird how it was failing silently

totally agree, I think the main issue was the agent had the correct configuration, but the container / env the agent was spinning was missing it,
I'll double check how come it did not print anything

  
  
Posted one month ago

Thank you! Although it's still really weird how it was failing silently - would it be worth changing the logging level for that error somewhere?

  
  
Posted one month ago

I think I've found a clue after running with debug:

Before Task.init
Retrying (Retry(total=239, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
Retrying (Retry(total=238, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:20:07
Retrying (Retry(total=237, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:20:12
Retrying (Retry(total=236, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:20:33
Retrying (Retry(total=235, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:21:03
Retrying (Retry(total=234, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:22:08
Retrying (Retry(total=233, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login

So my current theory is that the env the agent is building doesn't have the corporate TLS/SSL certificates. It's weird how it was failing silently without the --debug flag though...

  
  
Posted one month ago

Hi @<1724960464275771392:profile|DepravedBee82>
After

Starting Task Execution:

It will literally start the process running your code,
Can you send the full log of the Task? what is the code doing? which system is running the agent (i.e. Windows/Mac/Linux docker etc)

  
  
Posted one month ago

This is exactly my problem, too, which I described above! If you find any solution, would be glad if you could share. 🙂 Of course, I also share mine when I get one.

  
  
Posted one month ago

None

  
  
Posted one month ago

Looking at the logs in the Kube pods now for anything that looks unusual...

  
  
Posted one month ago

Thanks Martin - will try that and see what I can find. Really appreciate your patience with this! 🙂

  
  
Posted one month ago

Nope - confirmed to be running on the OS's Python environment, although he said that the agent was supposed to have it's own user - looking into that now

  
  
Posted one month ago

confirmed that the change had been added by

Make sure you see them in the Task log in the UI (the agent print it when it starts)

Any insight on how we can reproduce the issue?

Can this be reproducible using a simple script that we can also run?

  
  
Posted one month ago

Hi @<1724960464275771392:profile|DepravedBee82> , can you perhaps add a simple print at the start of your code before any import?

  
  
Posted one month ago

THAT WORKED! 🎉

  
  
Posted one month ago
765 Views
46 Answers
one month ago
one month ago
Tags