Reputation
Badges 1
38 × Eureka!Thank you! Although it's still really weird how it was failing silently - would it be worth changing the logging level for that error somewhere?
Thanks John, but is there a way to do this via the CLI?
Or is Task.init() the only way?
Also is there a way to disable this by default?
The reason I ask is that I want to send many jobs to a queue via the CLI. so I don't really want to be messing around with Task.init() .
I've even tried renaming my files to *pth and *.data to stop this behaviour
Which auto_connect_* arg do I use and what value to I set it to? At the end of my training run I'm making .png plots of everything in my test set, and I don't want these to be logged as artifacts.
It's not covered here either: None
Thanks Martin - will try that and see what I can find. Really appreciate your patience with this! 🙂
Hi @<1523701087100473344:profile|SuccessfulKoala55> thanks for the reply! The output above is from grep -i network /var/log/syslog on the machine running the agent. That's good to hear that clearml is pretty resilient to network outages 🙂 . Do you have any suggestions on how we can start tracking down the cause of this?
This is the only clue that was logged to the console in clearml server: 2024-11-21 06:57:13 Process terminated by user . The first errors on the agent logs appea...
Ok so my train.py now looks like this:
print("Before import")
from pathlib import Path
import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig
from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks
from clearml import Task
for i in range(torch.cuda.device_count()):
print(torch.cuda.get_device_properties(i).name)
...