Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Profile picture
DepravedBee82
Moderator
1 Question, 26 Answers
  Active since 19 July 2024
  Last activity one month ago

Reputation

0

Badges 1

26 × Eureka!
0 Votes
46 Answers
764 Views
0 Votes 46 Answers 764 Views
Hi all, I've successfully run a Task locally, and now I'm trying to clone it and send it to a Queue. It looks like the environment is built successfully, but...
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Ah yes you were right, it does still print on remote. Here you go:

environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/root', 'LOGNAME': 'root', 'USER': 'root', 'SHELL': '/bin/bash', 'INVOCATION_ID': '2cf51dc43b78470cb14c29f5f653ee18', 'JOURNAL_STREAM': '8:224108', 'SYSTEMD_EXEC_PID': '134947', 'PYTHONUNBUFFERED': '1', 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID', 'CLEARML_WORKER_ID': 'mrl-plswh100:0', 'TRAINS_WORKER_ID': 'mrl-plswh100:0', 'CLEARM...
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Hi @<1523701205467926528:profile|AgitatedDove14> , here's my code with some more prints:

from clearml import Task

print("Before Task.init")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
print("Before task.set_repo")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
print("Before task.set_packages")
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path...
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

He confirmed that it’s not inside a container. Trying to figure out why it’s running as root but would it make a difference if it was? Is it better to run the agent from a user profile?

Edit: it might be a container! Just checking now...

one month ago
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

I just ran with this in my local task, and all the env vars were printed to console, but in ClearML they are not in the console log. Presumably that's because it's printed before ClearML is logging?

one month ago
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Hi @<1523701205467926528:profile|AgitatedDove14> , I reordered the imports:

from clearml import Task

print("Before task")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_sl...
one month ago
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

If there was an SSL issue it should log to console right?

ClearML is hosted on an on-prem kube cluster and to get it to log locally I needed to append my company cert to the file located at certifi.where() . Do you think the same needs to be done for the Python installation for the worker?

one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Can this be reproducible using a simple script that we can also run?

Not really unfortunately - happy to share my code, but I've managed to reproduce this with different codebases.

As a summary of what I've tried:

  • Agent on the H100 machine, Server on Kube - Fail
  • Agent on laptop, Server on Kube - Fail
  • Agent on laptop, Server on Docker Desktop - Pass
    So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube rather than an issue with the agents or code....
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Thank you! Although it's still really weird how it was failing silently - would it be worth changing the logging level for that error somewhere?

one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

I think I've found a clue after running with debug:

Before Task.init
Retrying (Retry(total=239, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
Retrying (Retry(total=238, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: ...
one month ago
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Nope - confirmed to be running on the OS's Python environment, although he said that the agent was supposed to have it's own user - looking into that now

one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

I managed to set up my (Windows) laptop as a worker and reproduce the issue. Would that suggest an issue with ClearML server?

Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc. work, but if there are any obvious things we should check, let me know and I can ask our DevOps engineer

one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Here's what the agent was logging:

 anjum.sayed@M209886    clearml-agent --debug daemon --queue default
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.clearml.dev.mrl:443
DEBUG:urllib3.connectionpool:
 "PUT /auth.login HTTP/1.1" 200 603
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.clearml.dev.mrl:443
DEBUG:urllib3.connectionpool:
 "PUT /v2.5/queues.get_all HTTP/1.1" 200 344
DEBUG:urllib3.connectionpool:
...
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

It’s a Dell XE9680 rack server with 8xH100s which is located in our office, running AlmaOS. We have successfully run training jobs on it inside Docker (without ClearML) which work fine (will check with my team if we’ve got something to train without Docker). I’ve also tried different Python versions; 3.9 (Alma default) and 3.11 which you can see in the log above. It’s a really bizarre issue and outside of print statements I’m not really sure where to look.

You mentioned sync argparse...

one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

@<1523701205467926528:profile|AgitatedDove14> we've now configured the server to have it's own user account to run the agent so it is no longer running as root, but no luck 😞

Before os.environ
environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/home/clearml', 'LOGNAME': 'clearml', 'USER': 'clearml', 'SHELL': '/bin/bash', 'INVOCATION_ID': 'da8e36a03c7348efbb7db360755e92b3', 'JOURNAL_STREAM': '8:244189055', 'SYSTEMD_EXEC_P...
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Hmm no change after adding that unfortunately (confirmed that the change had been added by clearml-agent config ) 😞

one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Ok so my train.py now looks like this:

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks

from clearml import Task

for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_properties(i).name)

...
one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

My money is on the Redis container although comparing the logs between Kube & Docker Desktop, nothing looks out of the ordinary...

one month ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Thanks for the response @<1523701205467926528:profile|AgitatedDove14> ! The code is a small FMNIST test training job written in PyTorch Lightning. On my local job (laptop GPU, Windows) it completes in ~ 5min. On the server (Linux, H100s) it just hangs at Starting Task Execution: . Neither of these are in Docker.

I would expect to see the standard PL progress bars outputted to the console, but since nothing is outputted, so I'm not sure how to go about debugging this. I've attached the ...

one month ago