PreciousParrot26 I think this is really a matter of the CI process having very limited resources. just to be clear, you are correct and the steps them selves are Not executed inside the CI environment, but it seems that even running the pipeline logic is somehow "too much" for the limited resources... Make sense ?
Hi AgitatedDove14 !
there is no more storage left to run all those subprocesses
I see. Does this mean the /tmp directory? I might be a little unfamiliar here.
Also, why is so much storage space required to run the subprocesses (nodes)? The run_pipeline_steps_locally
flag is set to False (by default) in controller_object.start_locally()
. Only the pipelinecontroller should be running locally, right?
This is the maximum simultaneous jobs it will try to launch (it will launch more after the launching is doe, notice not the actual execution) but this is just a way to limit it.
Thanks for the explanation here!
Why are you running from gitlab runner
Hi CostlyOstrich36 We are trying to run integration tests on our pipelines.
To make sure that changes in tasks during merge requests, do not break the pipeline
AgitatedDove14 Can you please specify which resources we should increase? I haven't been able to observe any depleted resources on the runner while the pipeline is running (semaphores, threads, ram, cache), but I might be wrong here since the process crashes as soon as we hit the start_locally command.
Based on the log you have shared:OSError: [Errno 28] No space left on device
I would increase the storage ?
https://github.community/t/github-actions-failing-with-errno-28-no-space-left-on-device/18164/10
https://stackoverflow.com/questions/70175977/multiprocessing-no-space-left-on-device
https://groups.google.com/g/ansible-project/c/4U6MyvyvthQ
I would start by increasing the size of the TMPDIR folder
Hi PreciousParrot26 ,
Why are you running from gitlab runner - Are you interested in specific action triggers?
OSError: [Errno 28] No space left on device
Hi PreciousParrot26
I think this says it all 🙂 there is no more storage left to run all those subprocesses
btw:
I am curious about why a
ThreadPool
of
16
threads is gathered,
This is the maximum simultaneous jobs it will try to launch (it will launch more after the launching is doe, notice not the actual execution) but this is just a way to limit it.
AgitatedDove14 We are using the PipelineController class, with Tasks as steps.
Running this on a laptop, we could observe on the Web UI that the pipeline task was running locally, and the tasks on the agents.
The same script however crashes on the runner with the OSError.
pipe_task = Task.init(project_name=project_name, task_name=pipeline_name, task_type=Task.TaskTypes.controller, reuse_last_task_id=False, output_uri=None) ... pipe = PipelineController(name=pipeline_name, project=tasks_folder, version=pipeline_version, add_pipeline_tags=True, target_project=tasks_folder, abort_on_failure=True) ... pipe.add_step(name=step_name, parents=[], base_task_project=project_name, execution_queue=large_task_queue, parameter_override=step_args, base_task_factory=lambda node: Task.create( **step_constants) pipe.start_locally() # Crash on runner
controller_object.start_locally()
. Only the pipelinecontroller should be running locally, right?
Correct, do notice that if you are using Pipeline decorator and calling run_locally()
the actual the pipeline steps are also executed locally.
which of the two are you using (Tasks as steps, or functions as steps with decorator)?
kernel.sem = 50100 128256000 50100 2560
Don't think the semaphores should be depleted.
That example is quite large. We are not doing anything close to that, or even downloading any datasets/artifacts on the runner, and have ~40GB available in the /tmp
directory.
We can try to further increase the storage if there are no other ideas, but if that fixes anything, it would mean that there is a bug in the clearml sdk. So much storage shouldn't be needed to run the controller from my perspective.