Hey I’M Running This Script And Initialise The Clearml Task Also In This File

Answered

Hey I’m running this script and initialise the ClearML task also in this file https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py .
This script calls then https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py where multiprocessing.set_start_method('forkserver') is called. Unfortunately I get the following error, I tried to fix it using force=True but this caused the ClearML Task to stop. Any idea how to fix it? Thanks.

Task fails when using multiprocessing.set_start_method('forkserver')
Traceback (most recent call last): File "scripts/pretrain.py", line 68, in <module>   spawn_dist.run(args) # Multiple GPU training (8 recommended) File "fastmri/spawn_dist.py", line 59, in run   multiprocessing.set_start_method('forkserver') File "/usr/lib/python3.8/multiprocessing/context.py", line 243, in set_start_method   raise RuntimeError('context has already been set') RuntimeError: context has already been set
Task in mode completed when using multiprocessing.set_start_method('forkserver', force=True)
` RuntimeError:
    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.
    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

      if name == 'main':
        freeze_support()
        ...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.

Detected an exited process, so exiting main
terminating child processes
exiting `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Votes Newest

Answers 20

Hi ClumsyElephant70
What's the clearml you are using ?
(The first error is a by product of python process.Event created before a forkserver is created, some internal python issue. I thought it was solved, let me take a look at the code you attached)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Where did you add the Task.init call ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

clearml_agent v1.0.0 and clearml v1.0.2

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

` if name == "main":

task = Task.init(project_name="dummy",
             task_name="pretraining",
             task_type=Task.TaskTypes.training,
             reuse_last_task_id=False)

task.connect(args)
print('Arguments: {}'.format(args))

# only create the task, we will actually execute it later
task.execute_remotely()

spawn_dist.run(args) `I added it to this script and use it as a starting point   https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Actually I saw that the RuntimeError: context has already been set appears when the task is initialised outside if __name__ == "__main__":

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

I'm running now the the code shown above and will let you know if there is still an issue

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Actually I saw that the

RuntimeError: context has already been set

appears when the task is initialised outside

if name == "main":

Is this when you execute the code or when the agent ?
Also what's the OS of your machine/ agent ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This happens inside the agent, since I use task.execute_remotely() I guess. The agent runs on ubuntu 18.04 and not in docker mode

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

My code produces now an error inside one of the threads, but that should be an issue on my side. Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the if __name__ == "__main__": as seen above in the code snippet.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the

if name == "main":

as seen above in the code snippet.

I'm not sure I follow, the error seems like your internal code issue, does that means clearml works as expected ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

using this code in https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
` if name == "main":

task = Task.init(project_name="dummy",
             task_name="pretraining",
             task_type=Task.TaskTypes.training,
             reuse_last_task_id=False)

task.connect(args)
print('Arguments: {}'.format(args))

# only create the task, we will actually execute it later
task.execute_remotely()

spawn_dist.run(args) `I get this error

RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exitingbut this tensor size error is probably caused by my code and not clearml. Still I wonder if it is normal behaviour that clearml exits the experiments with status "completed" and not with failure, if a RuntimeError occurs in a child process

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Still I wonder if it is normal behavior that clearml exits the experiments with status "completed" and not with failure

Well that depends on the process exit code, if for some reason (not sure why) the process exits with return code 0, it means everything was okay.
I assume this is doing something "Detected an exited process, so exiting main" this is an internal print of your code, I guess it just leaves the process with exitcode 0

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hey AgitatedDove14 , I fixed my code issue and are now able to train on multiple gpus using the https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py . Since I create the ClearML Task in the main thread I now can't see any training plots and probably also not the output model. What would be the right approach? I would like to avoid using Task.current_task().upload_artifact() or manual logging. I really enjoy the automatic detection

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

I’m not sure if this was solved, but I am encountering a similar issue. From what I see it all depends on what multiprocessing start method is used.
When using fork ClearML works fine and it’s able to capture everything, however it is not recommended to use fork as it is not safe with multithreading (e.g. see None ).
With spawn and forkserver (which is used in the script above) ClearML is not able to automatically capture PyTorch scalars and artifacts. For what concerns spawn I think the reason is that a whole new python process is started from scratch. For forkserver I am not sure on the reason, as it should inherit some of the parent process memory.

Did anyone else encounter this issue in the meantime? Any solutions?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CloudyArcticwolf80
				
					0

I’m not sure if this was solved, but I am encountering a similar issue.

Yep, it was solved (I think v1.7+)

With

spawn

and

forkserver

(which is used in the script above) ClearML is not able to automatically capture PyTorch scalars and artifacts.

The "trick" is to have Task.init before you spawn your code, then (since your code will not start from the same state), you should call Task.current_task(), which would basically make sure everything is monitored.
(unfortunately patching spawn is trickier than fork so currently it need to be done manually)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thank you @<1523701205467926528:profile|AgitatedDove14> . I have Task.init() right at the beginning of the script (i.e. before multiprocessing), but I don’t have the Task.urrent_task() call, so maybe that would solve the issue. Where should that be? In the function that is parallelised? Or can it also be right after Task.init() ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CloudyArcticwolf80
				
					0

Or can it also be right after

Task.init()

?

That would work as well 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks, I’ll try that and report back.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CloudyArcticwolf80
				
					0

Hi @<1523701205467926528:profile|AgitatedDove14> ,
I can confirm that calling Task.current_task() makes ClearML log the console, models and scalars again 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CloudyArcticwolf80
				
					0

Write your answer

2K Views

20 Answers

4 years ago

2 years ago