Reputation
Badges 1
19 × Eureka!if I look at the code of the clearml controller.py, I see that it expects additional code at a relative folder
I created the pipeline on another machine via interactive python shell. The pipeline is picked up by clearml, as I see it on the web ui.
Initially, I had only one queue and one worker set-up. If the pipeline 'default execution queue' is the same as the queue used in pipe.start('the queue'), it gets into sort of a dead-lock and waits forever
my worker node is not a docker, but linux in conda environment
if I go to the folder as mentioned in the error and than one level up, I see no other packages present
but the behavior is different if you kick it off from a jupyter notebook (local) or a python script
Full console log of the worker:
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue b5fe1e72614247f7a77e5f6cdac35580
No tasks in Queues, sleeping for 5.0 seconds
task 30ad27a7a1244b6e8aa722d81cb6015c pulled from b5fe1e72614247f7a77e5f6cdac35580 by worker NLEIN-315GNH2:0
Running task '30ad27a7a1244b6e8aa722d81cb6015c'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.sppvun4p.txt', '/tmp/.clearml_agent_out.sppvun4p.txt'
Current configuration (clearml_agent v1.4.1, location:...
The notebook behavior is indeed how I expect it to work, the behavior via the script is strange
(the 'remo2' task is an existing experiment)
however, I did notice another issue.
looks like it's missing some dependencies
FYI: this is my pipeline script
from clearml import PipelineController
pipe = PipelineController(name="My Pipe", project="Gridsquare-Training", version="0.0.5")
pipe.add_step(name="pipe step 1", base_task_project="Gridsquare-Training", base_task_name="remo2")
pipe.add_step(name="pipe step 2", base_task_project="Gridsquare-Training", base_task_name="remo2", parents=["pipe step 1"])
pipe.set_default_execution_queue("myqueue")
pipe.start("service")
when I set-up two queues and two workers, set the default-execution-queue to one queue and use the other queue for pipe.start, it all works
I do not get more information than I just showed
but if I run exactly the same code from a python script (which also calls start on te pipeline), the worker node tries to check out the script and runs that (or fails if you didn't check it into git yet)
in case of the local jupyter notebook, I create the pipeline and when I start it, it all works without the necessity to add the jupyter notebook to git
Hi John, I've done more experiments and found that this only happens if you try to run the pipeline remotely directly from the python interpreter
the error occurs in the worker node when it tries to initialize the environment for the pipeline