Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Hi all, I've successfully run a Task locally, and now I'm trying to clone it and send it to a Queue. It looks like the environment is built successfully, but it hangs here:

Environment setup completed successfully
Starting Task Execution:

Is there any way of figuring out why the remote Task hangs and how would I go about debugging it?

WebApp: 1.15.1-478 • Server: 1.15.1-478 • API: 2.29

  
  
Posted 5 months ago
Votes Newest

Answers 46


Hmm, I'm without, no reason why it will get stuck .
Removing all the auto loggers, this can be done with

Task.init(..., auto_connect_frameworks=False)

which would disconnect all the automatic loggers (Hydra etc) off course this is for debugging purposes

  
  
Posted 4 months ago

None

  
  
Posted 4 months ago

He confirmed that it’s not inside a container. Trying to figure out why it’s running as root but would it make a difference if it was? Is it better to run the agent from a user profile?

Edit: it might be a container! Just checking now...

  
  
Posted 4 months ago

Hi @<1724960464275771392:profile|DepravedBee82> , can you perhaps add a simple print at the start of your code before any import?

  
  
Posted 5 months ago

Can you add before the Task.init

import os
print(os.environ)
  
  
Posted 4 months ago

THAT WORKED! 🎉

  
  
Posted 4 months ago

Hi @<1523701205467926528:profile|AgitatedDove14> , I reordered the imports:

from clearml import Task

print("Before task")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks


for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_properties(i).name)

And here's the output:

Environment setup completed successfully
Starting Task Execution:
Before task

Still looks like it's getting stuck at Task.init

  
  
Posted 5 months ago

This is so odd,
could you add prints right after the Task.init?
Also could you verify it still gets stuck with the latest RC

clearml==1.16.3rc2
  
  
Posted 5 months ago

I managed to set up my (Windows) laptop as a worker and reproduce the issue.

Any insight on how we can reproduce the issue?

  
  
Posted 4 months ago

confirmed that the change had been added by

Make sure you see them in the Task log in the UI (the agent print it when it starts)

Any insight on how we can reproduce the issue?

Can this be reproducible using a simple script that we can also run?

  
  
Posted 4 months ago

I just ran with this in my local task, and all the env vars were printed to console, but in ClearML they are not in the console log. Presumably that's because it's printed before ClearML is logging?

  
  
Posted 4 months ago

It’s a Dell XE9680 rack server with 8xH100s which is located in our office, running AlmaOS. We have successfully run training jobs on it inside Docker (without ClearML) which work fine (will check with my team if we’ve got something to train without Docker). I’ve also tried different Python versions; 3.9 (Alma default) and 3.11 which you can see in the log above. It’s a really bizarre issue and outside of print statements I’m not really sure where to look.

You mentioned sync argparser & reporting, so I’ll try removing Hydra to rule that out, and other loggers in PL and see from there …

  
  
Posted 4 months ago

Thanks Martin - will try that and see what I can find. Really appreciate your patience with this! 🙂

  
  
Posted 4 months ago

Thanks for the response @<1523701205467926528:profile|AgitatedDove14> ! The code is a small FMNIST test training job written in PyTorch Lightning. On my local job (laptop GPU, Windows) it completes in ~ 5min. On the server (Linux, H100s) it just hangs at Starting Task Execution: . Neither of these are in Docker.

I would expect to see the standard PL progress bars outputted to the console, but since nothing is outputted, so I'm not sure how to go about debugging this. I've attached the full logs for local and remote

  
  
Posted 5 months ago

My understanding is that on remote execution Task.init is supposed to be a no-op right?

  
  
Posted 4 months ago

Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc.

The only thing that I can think of is that something is not right the the load balancer on the server so maybe some requests coming from an instance on the cluster are blocked ...
Hmm, saying that aloud that actually could be?! Try to add the following line to the end of the clearml.conf on the machine running the agent:

api.http.default_method: "put"
  
  
Posted 4 months ago
12K Views
46 Answers
5 months ago
4 months ago
Tags