Reputation
Badges 1
19 × Eureka!AgitatedDove14 The workflow I'm trying to reach is: developing from the development PC and enqueuing the training pipelines to training server. That's why I employed such workflow. If there is a better practice, or if the thing I was doing is not an intended usecase, I'm open for suggestions.
And here is the training script:
` import os
import sys
from torch import Tensor
sys.path.append("/workspace/")
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import TensorBoardLogger
from easyvision.zoo.edsr.model import EDSR
from model import Model
from dataset import TrainingDataset
from clearml import Task, Dataset
def train():
dataset_path = Dataset.get(
dataset_id="bcd566344203462b839a7ba08dd9efa7"
).get_local_copy()
task = Task...
AgitatedDove14 yeah, exactly.
AnxiousSeal95 Yeah, it came to my mind. But I guess each agent should be fired from different ubuntu accounts as the agent picks up the config file from home directory automatically.
Okay, after lots of trials and failures, I found that the execution script should be on git too. The changes are being sent by clearml automatically, but the files that do not exist within the repo are apparently are not being sent. This is somehow counter-intuitive.
AgitatedDove14 I have a training job, very similar to this one: https://clearml.slack.com/archives/CTK20V944/p1654606983176539?thread_ts=1654604976.568279&cid=CTK20V944
TimelyMouse69 Thanks for leading the way
AgitatedDove14 Certainly! This completely aligns with my observations. However, this one should be a feature to work on, and should be fairly easy to implement.
SweetBadger76 Figured it out. Turns out to be, the issue was caused by a code written in earlier pytorch lightning versions does not work as intended with the current version. This was causing bad tensorboard outputs, or no outputs at all.
This does not capture any logging info. Just system monitors
AgitatedDove14 And I'm sending the job via the specified code at the beginning of this thread.
As soon as I launch the job with git, the task marks itself as completed without launching the actual job, even if I mount the volume as I do without git.
AgitatedDove14 AFAIK, ClearML sends the git repo, branch and commit id IF the a git repo is present at the working directory, without needing me to specify it. When it does send those information, clearml agent tries to pull the repo with the specified branch and commit id, and the project goes on after that. This is what I meant by mentioning "git integration". If a git repo is not present at the working directory, clearml agent just bypasses the "pulling the repo" part, as there is none sp...
AgitatedDove14 Sorry for the very late response. The driver script (the one initializes models and initializes a training sequence) was not at git repo and besides that one, everything is.
That example shows literally nothing than Task.init
line, which heavily relies on user employing init
function to create task, and clearml being able to capture tensorboard data. However, I'm trying to create a task without running it on local computer. So, I found creating task from another script with Task.create
function more convinient. Here is how I create the task from another python file:
` from clearml import Task
task = Task.create(
project_name="training",
...
AgitatedDove14 CostlyOstrich36 That was exactly what I was doing ( docker_args="-v /mnt/host:/mnt/container
).
AgitatedDove14 Yeah, images logged with tensorboard apparently stay at the experiment container, and copied to nowhere else. I just expected them to be moved into fileserver, just like clearml's own logger or auto-captured model artifacts. I resolved it by binding a persistent volume into the experiment container and saving tensorboard logs in it.
AgitatedDove14 To elaborate, the code below does not work with git integration activated.
` from clearml import Task
task = Task.create(
project_name="deneme",
task_name="git deneme",
packages=["protobuf==3.20.0"],
docker="databossds/easyvision",
docker_args="-v /home/user/awesome_dir:/workspace",
add_task_init_call=True,
script="train.py",
)
Task.enqueue(task, "default") `
However, the very same code does work WITHOUT git integration activated.