Hi @<1523701205467926528:profile|AgitatedDove14> , here's my code with some more prints:
from clearml import Task
print("Before Task.init")
task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
print("Before task.set_repo")
task.set_repo(
repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
print("Before task.set_packages")
task.set_packages("requirements.txt")
print("After task")
print("Before import")
from pathlib import Path
import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig
from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks
print("After import")
I've attached the full log (using RC2). Still getting stuck at Task.init
- very weird
This is so odd,
could you add prints right after the Task.init?
Also could you verify it still gets stuck with the latest RC
clearml==1.16.3rc2
My understanding is that on remote execution Task.init is supposed to be a no-op right?
Not really a no-op, it would sync Argpasrer and the like, start background reporting services etc.
This is so odd! literally nothing printed
Can you tell me something about the node "mrl-plswh100:0" ?
is this like a sagemaker node? we have seen things similar where Python threads / subprocesses are not supported and instead of python crashing it just hangs there
Hi @<1724960464275771392:profile|DepravedBee82> , can you perhaps add a simple print at the start of your code before any import?
I've added that flag, removed all PL loggers & callbacks and all references to Hydra, but no luck 😞
confirmed that the change had been added by
Make sure you see them in the Task log in the UI (the agent print it when it starts)
Any insight on how we can reproduce the issue?
Can this be reproducible using a simple script that we can also run?
Hi @<1724960464275771392:profile|DepravedBee82>
After
Starting Task Execution:
It will literally start the process running your code,
Can you send the full log of the Task? what is the code doing? which system is running the agent (i.e. Windows/Mac/Linux docker etc)
Sorry, on the remote machine (i.e. enqueue it and let the agent run it), this will also log the print 🙂
Will try non-root and get back to you. I’m also trying to reproduce on a different machine too
My understanding is that on remote execution Task.init is supposed to be a no-op right?
Hmm, I'm without, no reason why it will get stuck .
Removing all the auto loggers, this can be done with
Task.init(..., auto_connect_frameworks=False)
which would disconnect all the automatic loggers (Hydra etc) off course this is for debugging purposes
Yes the agent is running in venv mode afaik. As for why it’s running as root - I’ll ask our engineer …
Nope - confirmed to be running on the OS's Python environment, although he said that the agent was supposed to have it's own user - looking into that now
This is exactly my problem, too, which I described above! If you find any solution, would be glad if you could share. 🙂 Of course, I also share mine when I get one.
He confirmed that it’s not inside a container. Trying to figure out why it’s running as root but would it make a difference if it was? Is it better to run the agent from a user profile?
Edit: it might be a container! Just checking now...