Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Profile picture
DepravedBee82
Moderator
6 Questions, 38 Answers
  Active since 19 July 2024
  Last activity 4 days ago

Reputation

0

Badges 1

38 × Eureka!
0 Votes
3 Answers
831 Views
0 Votes 3 Answers 831 Views
Hi all - I have a large dataset and have preprocessed it and saved each item in .pt files, which are loaded using torch.load in my Dataset . The issue is tha...
11 months ago
0 Votes
3 Answers
36 Views
0 Votes 3 Answers 36 Views
Hi all, is it possible to pass Hydra args via the clearml-task CLI? Using --args doesn't seem to work as it should with Hydra - they do appear as args but ar...
5 days ago
0 Votes
4 Answers
718 Views
0 Votes 4 Answers 718 Views
Hi all, what is the best way of getting ClearML to pull code from GitHub repos? At the moment we can pull using a users SSH credentials, but AFAIK it's not p...
6 months ago
0 Votes
6 Answers
1K Views
0 Votes 6 Answers 1K Views
11 months ago
0 Votes
3 Answers
731 Views
0 Votes 3 Answers 731 Views
Hi all, is there a way to completely disable all artifact logging?
11 months ago
0 Votes
46 Answers
134K Views
0 Votes 46 Answers 134K Views
Hi all, I've successfully run a Task locally, and now I'm trying to clone it and send it to a Queue. It looks like the environment is built successfully, but...
one year ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Hi @<1523701205467926528:profile|AgitatedDove14> , here's my code with some more prints:

from clearml import Task

print("Before Task.init")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
print("Before task.set_repo")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
print("Before task.set_packages")
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path...
one year ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

@<1523701205467926528:profile|AgitatedDove14> we've now configured the server to have it's own user account to run the agent so it is no longer running as root, but no luck 😞

Before os.environ
environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/home/clearml', 'LOGNAME': 'clearml', 'USER': 'clearml', 'SHELL': '/bin/bash', 'INVOCATION_ID': 'da8e36a03c7348efbb7db360755e92b3', 'JOURNAL_STREAM': '8:244189055', 'SYSTEMD_EXEC_P...
one year ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

My money is on the Redis container although comparing the logs between Kube & Docker Desktop, nothing looks out of the ordinary...

one year ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Yes the agent is running in venv mode afaik. As for why it’s running as root - I’ll ask our engineer …

one year ago
one year ago
0 Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs. For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke

Hi all, we're still suffering this issue where jobs are seemingly randomly aborted. The only clue is this in the ClearML logs:

2024-12-13 06:16:30  Process terminated by user

The only pattern we can see is that it typically happens around 6-7am.

Any suggestions on how to debug this would be greatly appreciated!

10 months ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

I think I've found a clue after running with debug:

Before Task.init
Retrying (Retry(total=239, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
Retrying (Retry(total=238, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: ...
one year ago
0 Hi All, Is There A Way To Completely Disable All Artifact Logging?

Which auto_connect_* arg do I use and what value to I set it to? At the end of my training run I'm making .png plots of everything in my test set, and I don't want these to be logged as artifacts.

It's not covered here either: None

11 months ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Thanks for the response @<1523701205467926528:profile|AgitatedDove14> ! The code is a small FMNIST test training job written in PyTorch Lightning. On my local job (laptop GPU, Windows) it completes in ~ 5min. On the server (Linux, H100s) it just hangs at Starting Task Execution: . Neither of these are in Docker.

I would expect to see the standard PL progress bars outputted to the console, but since nothing is outputted, so I'm not sure how to go about debugging this. I've attached the ...

one year ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

I managed to set up my (Windows) laptop as a worker and reproduce the issue. Would that suggest an issue with ClearML server?

Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc. work, but if there are any obvious things we should check, let me know and I can ask our DevOps engineer

one year ago
one year ago
0 Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs. For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke

It seems like the worker lost network connectivity, and then aborted the jobs 😞

2024-11-21T06:56:01.958962+00:00 mrl-plswh100 systemd-networkd-wait-online[2279529]: Timeout occurred while waiting for network connectivity.
2024-11-21T06:56:01.976055+00:00 mrl-plswh100 apt-helper[2279520]: E: Sub-process /lib/systemd/systemd-networkd-wait-online returned an error code (1)
2024-11-21T06:57:15.810747+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_...
11 months ago
0 Hi All - I Have A Large Dataset And Have Preprocessed It And Saved Each Item In .Pt Files, Which Are Loaded Using

Also is there a way to disable this by default?

The reason I ask is that I want to send many jobs to a queue via the CLI. so I don't really want to be messing around with Task.init() .

I've even tried renaming my files to *pth and *.data to stop this behaviour

11 months ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

It’s a Dell XE9680 rack server with 8xH100s which is located in our office, running AlmaOS. We have successfully run training jobs on it inside Docker (without ClearML) which work fine (will check with my team if we’ve got something to train without Docker). I’ve also tried different Python versions; 3.9 (Alma default) and 3.11 which you can see in the log above. It’s a really bizarre issue and outside of print statements I’m not really sure where to look.

You mentioned sync argparse...

one year ago
0 Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs. For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke

Hi @<1523701087100473344:profile|SuccessfulKoala55> thanks for the reply! The output above is from grep -i network /var/log/syslog on the machine running the agent. That's good to hear that clearml is pretty resilient to network outages 🙂 . Do you have any suggestions on how we can start tracking down the cause of this?

This is the only clue that was logged to the console in clearml server: 2024-11-21 06:57:13 Process terminated by user . The first errors on the agent logs appea...

11 months ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Hmm no change after adding that unfortunately (confirmed that the change had been added by clearml-agent config ) 😞

one year ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Ah yes you were right, it does still print on remote. Here you go:

environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/root', 'LOGNAME': 'root', 'USER': 'root', 'SHELL': '/bin/bash', 'INVOCATION_ID': '2cf51dc43b78470cb14c29f5f653ee18', 'JOURNAL_STREAM': '8:224108', 'SYSTEMD_EXEC_PID': '134947', 'PYTHONUNBUFFERED': '1', 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID', 'CLEARML_WORKER_ID': 'mrl-plswh100:0', 'TRAINS_WORKER_ID': 'mrl-plswh100:0', 'CLEARM...
one year ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Hi @<1523701205467926528:profile|AgitatedDove14> , I reordered the imports:

from clearml import Task

print("Before task")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_sl...
one year ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Here's what the agent was logging:

 anjum.sayed@M209886    clearml-agent --debug daemon --queue default
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.clearml.dev.mrl:443
DEBUG:urllib3.connectionpool:
 "PUT /auth.login HTTP/1.1" 200 603
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.clearml.dev.mrl:443
DEBUG:urllib3.connectionpool:
 "PUT /v2.5/queues.get_all HTTP/1.1" 200 344
DEBUG:urllib3.connectionpool:
...
one year ago
4 days ago
0 Hi All - I Have A Large Dataset And Have Preprocessed It And Saved Each Item In .Pt Files, Which Are Loaded Using

Thanks John, but is there a way to do this via the CLI?

Or is Task.init() the only way?

11 months ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

I just ran with this in my local task, and all the env vars were printed to console, but in ClearML they are not in the console log. Presumably that's because it's printed before ClearML is logging?

one year ago
0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

He confirmed that it’s not inside a container. Trying to figure out why it’s running as root but would it make a difference if it was? Is it better to run the agent from a user profile?

Edit: it might be a container! Just checking now...

one year ago
0 Hi All, Is There A Way To Completely Disable All Artifact Logging?

I was hoping something like output_uri=False would work, but looking at the source code, I don't think that would work @<1523701070390366208:profile|CostlyOstrich36>

11 months ago
Show more results compactanswers