Hi Everyone! We Are Trying To Run Pipelines From Gitlab Ci Runners, But Are Faced With The Following Error When Performing

Answered

Hi everyone! We are trying to run pipelines from Gitlab CI runners, but are faced with the following error when performing controller_object.start_locally() :
Exception in thread Thread-6: Traceback (most recent call last): File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/local/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/site-packages/clearml/automation/controller.py", line 1880, in _daemon launch_thread_pool = ThreadPool(16) File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 925, in __init__ Pool.__init__(self, processes, initializer, initargs) File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 196, in __init__ self._change_notifier = self._ctx.SimpleQueue() File "/usr/local/lib/python3.8/multiprocessing/context.py", line 113, in SimpleQueue return SimpleQueue(ctx=self.get_context()) File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 336, in __init__ self._rlock = ctx.Lock() File "/usr/local/lib/python3.8/multiprocessing/context.py", line 68, in Lock return Lock(ctx=self.get_context()) File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 162, in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx) File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 57, in __init__ sl = self._semlock = _multiprocessing.SemLock( OSError: [Errno 28] No space left on deviceHas anyone faced this problem before?
I am curious about why a ThreadPool of 16 threads is gathered, even if there aren't as many nodes in the pipeline at once, or even if the host machine does not have 16 threads in total (which it did). Does anyone have an explanation for the magic number 16? It is also worth mentioning that the same error could not be reproduced on a local machine or on a clearml agent, but was repeatedly encountered on the Gitlab runner. Any suggestions about which machine properties could affect this? The number of semaphores available was confirmed to be adequate.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PreciousParrot26
				
					0
					 × 1

Votes Newest

Answers 10

Based on the log you have shared:
OSError: [Errno 28] No space left on deviceI would increase the storage ?
https://github.community/t/github-actions-failing-with-errno-28-no-space-left-on-device/18164/10
https://stackoverflow.com/questions/70175977/multiprocessing-no-space-left-on-device
https://groups.google.com/g/ansible-project/c/4U6MyvyvthQ
I would start by increasing the size of the TMPDIR folder

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Why are you running from gitlab runner

Hi CostlyOstrich36 We are trying to run integration tests on our pipelines.
To make sure that changes in tasks during merge requests, do not break the pipeline

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PreciousParrot26
				
					0
					 × 1

OSError: [Errno 28] No space left on deviceHi PreciousParrot26
I think this says it all 🙂 there is no more storage left to run all those subprocesses

btw:

I am curious about why a

ThreadPool

of

16

threads is gathered,

This is the maximum simultaneous jobs it will try to launch (it will launch more after the launching is doe, notice not the actual execution) but this is just a way to limit it.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

PreciousParrot26 I think this is really a matter of the CI process having very limited resources. just to be clear, you are correct and the steps them selves are Not executed inside the CI environment, but it seems that even running the pipeline logic is somehow "too much" for the limited resources... Make sense ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Can you please specify which resources we should increase? I haven't been able to observe any depleted resources on the runner while the pipeline is running (semaphores, threads, ram, cache), but I might be wrong here since the process crashes as soon as we hit the start_locally command.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PreciousParrot26
				
					0
					 × 1

Hi PreciousParrot26 ,

Why are you running from gitlab runner - Are you interested in specific action triggers?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

AgitatedDove14 We are using the PipelineController class, with Tasks as steps.
Running this on a laptop, we could observe on the Web UI that the pipeline task was running locally, and the tasks on the agents.
The same script however crashes on the runner with the OSError.

pipe_task = Task.init(project_name=project_name, task_name=pipeline_name, task_type=Task.TaskTypes.controller, reuse_last_task_id=False, output_uri=None) ... pipe = PipelineController(name=pipeline_name, project=tasks_folder, version=pipeline_version, add_pipeline_tags=True, target_project=tasks_folder, abort_on_failure=True) ... pipe.add_step(name=step_name, parents=[], base_task_project=project_name, execution_queue=large_task_queue, parameter_override=step_args, base_task_factory=lambda node: Task.create( **step_constants) pipe.start_locally() # Crash on runner

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PreciousParrot26
				
					0
					 × 1

kernel.sem = 50100 128256000 50100 2560 Don't think the semaphores should be depleted.
That example is quite large. We are not doing anything close to that, or even downloading any datasets/artifacts on the runner, and have ~40GB available in the /tmp directory.
We can try to further increase the storage if there are no other ideas, but if that fixes anything, it would mean that there is a bug in the clearml sdk. So much storage shouldn't be needed to run the controller from my perspective.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PreciousParrot26
				
					0
					 × 1

Hi AgitatedDove14 !

there is no more storage left to run all those subprocesses

I see. Does this mean the /tmp directory? I might be a little unfamiliar here.
Also, why is so much storage space required to run the subprocesses (nodes)? The run_pipeline_steps_locally flag is set to False (by default) in controller_object.start_locally() . Only the pipelinecontroller should be running locally, right?

This is the maximum simultaneous jobs it will try to launch (it will launch more after the launching is doe, notice not the actual execution) but this is just a way to limit it.

Thanks for the explanation here!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PreciousParrot26
				
					0
					 × 1

controller_object.start_locally()

. Only the pipelinecontroller should be running locally, right?

Correct, do notice that if you are using Pipeline decorator and calling run_locally() the actual the pipeline steps are also executed locally.
which of the two are you using (Tasks as steps, or functions as steps with decorator)?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

972 Views

10 Answers

2 years ago

one year ago