Hi ZanyPig66
I used tensorboard as clearml claims to auto-capture tensorboard outputs, but it was a no go.
The auto TB logging should work out of the box, where is it failing ?
Also,task = Task.current_task()
Why aren't you using Task.init in the original script?
The idea is that you run your code on your machine (where the environment works), ClearML auto detects code + python packages + args etc.
Then you clone it in the UI and launch it on a remote machine.
What am I missing here?
EDIT:
So, I found creating task from another script with
Task.create
function more convenient. Here is how I create the task from another python file:
Understood, are you saying the auto logging doe snot work when running on the agent ? this seems odd to me, could it be that TB was not installed ? any chance you can provide the log of the execution?
Hi,
ClearML indeed has TensorBoard auto reporting. I suggest you to have a look here, wherre you could find links to some examples : https://clear.ml/docs/latest/docs/fundamentals/logger#automatic-reporting-examples
You could also have a look at the example of pytorch-lightning integration here :
https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch-lightning/pytorch_lightning_example.py
If you face an issue, can you send me a snippet, so that i could better understand what is happening ? thanks
This does not capture any logging info. Just system monitors
And here is the training script:
` import os
import sys
from torch import Tensor
sys.path.append("/workspace/")
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import TensorBoardLogger
from easyvision.zoo.edsr.model import EDSR
from model import Model
from dataset import TrainingDataset
from clearml import Task, Dataset
def train():
dataset_path = Dataset.get(
dataset_id="bcd566344203462b839a7ba08dd9efa7"
).get_local_copy()
task = Task.current_task()
params = {
"batch_size": 8,
"gpus": 1,
"auto_select_gpu": True
}
params = task.connect(params)
print(params)
dataset = SRDataset(
path=dataset_path
)
model = Model()
detector_sr = PLWrapper({
"model": model,
"dataset_train": dataset,
"dataset_val": dataset,
"batch_size_train": params.get("batch_size", 4),
"batch_size_val": params.get("batch_size", 4),
"num_workers": params.get("batch_size", 4)
})
trainer = Trainer(
check_val_every_n_epoch=1,
num_sanity_val_steps=0,
gpus=int(params.get("gpus", 1)),
benchmark=True,
auto_select_gpus=bool(params.get("auto_select_gpu", True)),
logger=TensorBoardLogger(save_dir="logs")
)
trainer.fit(detector_sr)
if name == "main":
train() `
SweetBadger76 Figured it out. Turns out to be, the issue was caused by a code written in earlier pytorch lightning versions does not work as intended with the current version. This was causing bad tensorboard outputs, or no outputs at all.
That example shows literally nothing than Task.init
line, which heavily relies on user employing init
function to create task, and clearml being able to capture tensorboard data. However, I'm trying to create a task without running it on local computer. So, I found creating task from another script with Task.create
function more convinient. Here is how I create the task from another python file:
` from clearml import Task
task = Task.create(
project_name="training",
task_name="training",
packages=["protobuf==3.20.0"],
docker="mydockerimage",
docker_args="-v /home/username/code:/workspace",
add_task_init_call=True,
script="train.py",
) `
AgitatedDove14 The workflow I'm trying to reach is: developing from the development PC and enqueuing the training pipelines to training server. That's why I employed such workflow. If there is a better practice, or if the thing I was doing is not an intended usecase, I'm open for suggestions.
thanks for those info. i check that and come back to you