I Have Built A Custom Docker Image And Execution Script So That I Can Use Conda As The Package Manager When Installing Python Packages For Job Execution. Everything Is Working Fine In Terms Of Environment Installation, However, On Execution Of The Model T

Answered

I have built a custom docker image and execution script so that I can use Conda as the package manager when installing python packages for job execution. Everything is working fine in terms of environment installation, however, on execution of the model training, I get the following error relating to the docker container:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Q. Is there a docker engine setting that I need to change, or something I can set in the container init bash script, or put in the clearml.conf to pass an argument to the docker run command?

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Votes Newest

Answers 15

If I did that, I am pretty sure that's the last thing I'd ever do...... 🤣

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Oh, so this applies to VRAM, not RAM?

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Basically it gives it direct access to the host, this is why it is considered less safe (access on other levels as well, like network)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This appears to confirm it as well.

https://github.com/pytorch/pytorch/issues/1158

Thanks AgitatedDove14 , you're very helpful.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

LOL I see a meme waiting for GrumpyPenguin23 😉

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Seriously though, thank you.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

I believe the standard shared allocation for a docker container is 64 MB, which is obviously not enough for training deep learning image classification networks, but I am unsure of the best solution to fix the problem.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

In my case it's a Tesla P40, which has 24 GB VRAM.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Does "--ipc=host" make it a dynamic allocation then?

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Hmm good question, I'm actually not sure if you can pass 24GB (this is not a limit on the GPU memory, this affects the memblock size, I think)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Pffff security.

Data scientist be like....... 😀

Network infrastructure person be like ...... 😱

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Yes 🙂 https://discuss.pytorch.org/t/shm-error-in-docker/22755
add either "--ipc=host" or "--shm-size= 8g " to the docker args (on the Task or globally in the clearml.conf extra_docker_args)
notice the 8g depends on the GPU

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sure thing, my pleasure 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

LOL can't wait to see that

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I'll just take a screenshot from my companies daily standup of data scientists and software developers..... that'll be enough!

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Write your answer

1K Views

15 Answers

3 years ago

2 years ago