Hi everyone. I have an issue with the simple pipeline - it runs two similar nn training steps (tf2.3, windows10, python 3.7) with only difference is a batch size. I'm running first separately each step to have them in ClearML project page. Then I run pipeline controller, which makes a clone of each step and runs smoothly. If I run pipeline from command string again, it works Ok. However, if I clone and enqueue the pipeline, it starts, creates the clone of the fist step pending and then nothing happens. First step remains pending and doesn't start. Can anyone help with the issue? Here's the pipeline controller code:
` from clearml import Task
from clearml.automation.controller import PipelineController

Connecting ClearML with the current process,

from here on everything is logged automatically

task = Task.init(project_name='Tom', task_name='test pipeline',
task_type=Task.TaskTypes.controller, reuse_last_task_id=False)

pipe = PipelineController(default_execution_queue='default', add_pipeline_tags=False)
pipe.add_step(name='train_1st_nn_copy', base_task_project='Tom', base_task_name='train_1st_nn', parameter_override={'batch_size': 8})
pipe.add_step(name='train_2nd_nn_copy', parents=['train_1st_nn_copy', ],
base_task_project='Tom', base_task_name='train_2nd_nn',
parameter_override={'batch_size': 4})

Starting the pipeline (in the background)


Wait until pipeline terminates


cleanup everything


print('done') `If I abort pipeline controller task, pending "train_1st_nn" task executes ok.

Posted 3 years ago
Votes Newest

Answers 31


if I simply clone nntraining stage and run it in default queue - everything goes fine.

When you compare the Task you clone manually and the Task created by the pipeline , what's the difference ?

Posted 3 years ago