Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Hi all, I've successfully run a Task locally, and now I'm trying to clone it and send it to a Queue. It looks like the environment is built successfully, but it hangs here:

Environment setup completed successfully
Starting Task Execution:

Is there any way of figuring out why the remote Task hangs and how would I go about debugging it?

WebApp: 1.15.1-478 • Server: 1.15.1-478 • API: 2.29

  
  
Posted 4 months ago
Votes Newest

Answers 46


My understanding is that on remote execution Task.init is supposed to be a no-op right?

Not really a no-op, it would sync Argpasrer and the like, start background reporting services etc.

This is so odd! literally nothing printed
Can you tell me something about the node "mrl-plswh100:0" ?
is this like a sagemaker node? we have seen things similar where Python threads / subprocesses are not supported and instead of python crashing it just hangs there

  
  
Posted 4 months ago

I just ran with this in my local task, and all the env vars were printed to console, but in ClearML they are not in the console log. Presumably that's because it's printed before ClearML is logging?

  
  
Posted 4 months ago

Okay I have an idea, it could be a lock that another agent/user is holding on the cache folder or similar
Let me check something

  
  
Posted 4 months ago

How are you starting the agent?

  
  
Posted 4 months ago

It’s a Dell XE9680 rack server with 8xH100s which is located in our office, running AlmaOS. We have successfully run training jobs on it inside Docker (without ClearML) which work fine (will check with my team if we’ve got something to train without Docker). I’ve also tried different Python versions; 3.9 (Alma default) and 3.11 which you can see in the log above. It’s a really bizarre issue and outside of print statements I’m not really sure where to look.

You mentioned sync argparser & reporting, so I’ll try removing Hydra to rule that out, and other loggers in PL and see from there …

  
  
Posted 4 months ago

@<1724960464275771392:profile|DepravedBee82> I just realized, the agent is Not running in docker mode, correct? (i.e. venv mode)
If this is the case how come it is running as root? (could it be is is running inside a container? how was that container spinned?)

  
  
Posted 4 months ago

Yes the agent is running in venv mode afaik. As for why it’s running as root - I’ll ask our engineer …

  
  
Posted 4 months ago

Nope - confirmed to be running on the OS's Python environment, although he said that the agent was supposed to have it's own user - looking into that now

  
  
Posted 4 months ago

Nope - confirmed to be running on the OS's Python environment,

okay so bare metal root is definitely not recommended.
I'm not sure how/why it get's stuck though 😞
Any chance you can run the agent as non-root?
Also maybe preferred in docker mode, so it is easier for you to control the environment of the Task

  
  
Posted 4 months ago

Please let me know what you find 🤞

  
  
Posted 4 months ago

Hmm no change after adding that unfortunately (confirmed that the change had been added by clearml-agent config ) 😞

  
  
Posted 3 months ago

I managed to set up my (Windows) laptop as a worker and reproduce the issue.

Any insight on how we can reproduce the issue?

  
  
Posted 3 months ago

Looking at the logs in the Kube pods now for anything that looks unusual...

  
  
Posted 3 months ago

confirmed that the change had been added by

Make sure you see them in the Task log in the UI (the agent print it when it starts)

Any insight on how we can reproduce the issue?

Can this be reproducible using a simple script that we can also run?

  
  
Posted 3 months ago

If there was an SSL issue it should log to console right?

ClearML is hosted on an on-prem kube cluster and to get it to log locally I needed to append my company cert to the file located at certifi.where() . Do you think the same needs to be done for the Python installation for the worker?

  
  
Posted 4 months ago

If there was an SSL issue it should log to console right?

correct, also the agent is able to report, so I'm assuming configuration is correct
@<1724960464275771392:profile|DepravedBee82> could you try to put the clearml import + Task .init at the top of your code?

  
  
Posted 4 months ago
11K Views
46 Answers
4 months ago
3 months ago
Tags