Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey I’M Running This Script And Initialise The Clearml Task Also In This File

Hey I’m running this script and initialise the ClearML task also in this file https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py .
This script calls then https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py where multiprocessing.set_start_method('forkserver') is called. Unfortunately I get the following error, I tried to fix it using force=True but this caused the ClearML Task to stop. Any idea how to fix it? Thanks.

Task fails when using multiprocessing.set_start_method('forkserver')
Traceback (most recent call last):  File "scripts/pretrain.py", line 68, in <module>   spawn_dist.run(args) # Multiple GPU training (8 recommended)  File "fastmri/spawn_dist.py", line 59, in run   multiprocessing.set_start_method('forkserver')  File "/usr/lib/python3.8/multiprocessing/context.py", line 243, in set_start_method   raise RuntimeError('context has already been set') RuntimeError: context has already been set
Task in mode completed when using multiprocessing.set_start_method('forkserver', force=True)
` RuntimeError: 
    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.
    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

      if name == 'main':
        freeze_support()
        ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Detected an exited process, so exiting main
terminating child processes
exiting `

  
  
Posted 2 years ago
Votes Newest

Answers 20


using this code in https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
` if name == "main":

task = Task.init(project_name="dummy",
             task_name="pretraining",
             task_type=Task.TaskTypes.training,
             reuse_last_task_id=False)

task.connect(args)
print('Arguments: {}'.format(args))

# only create the task, we will actually execute it later
task.execute_remotely()

spawn_dist.run(args) `I get this error

RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exitingbut this tensor size error is probably caused by my code and not clearml. Still I wonder if it is normal behaviour that clearml exits the experiments with status "completed" and not with failure, if a RuntimeError occurs in a child process

  
  
Posted 2 years ago

I’m not sure if this was solved, but I am encountering a similar issue. From what I see it all depends on what multiprocessing start method is used.
When using fork ClearML works fine and it’s able to capture everything, however it is not recommended to use fork as it is not safe with multithreading (e.g. see None ).
With spawn and forkserver (which is used in the script above) ClearML is not able to automatically capture PyTorch scalars and artifacts. For what concerns spawn I think the reason is that a whole new python process is started from scratch. For forkserver I am not sure on the reason, as it should inherit some of the parent process memory.

Did anyone else encounter this issue in the meantime? Any solutions?

  
  
Posted one year ago

Actually I saw that the 

RuntimeError: context has already been set

  appears when the task is initialised outside 

if name == "main":

Is this when you execute the code or when the agent ?
Also what's the OS of your machine/ agent ?

  
  
Posted 2 years ago

This happens inside the agent, since I use task.execute_remotely() I guess. The agent runs on ubuntu 18.04 and not in docker mode

  
  
Posted 2 years ago

Hey AgitatedDove14 , I fixed my code issue and are now able to train on multiple gpus using the https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py . Since I create the ClearML Task in the main thread I now can't see any training plots and probably also not the output model. What would be the right approach? I would like to avoid using Task.current_task().upload_artifact() or manual logging. I really enjoy the automatic detection

  
  
Posted 2 years ago

My code produces now an error inside one of the threads, but that should be an issue on my side. Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the if __name__ == "__main__": as seen above in the code snippet.

  
  
Posted 2 years ago

Still I wonder if it is normal behavior that clearml exits the experiments with status "completed" and not with failure

Well that depends on the process exit code, if for some reason (not sure why) the process exits with return code 0, it means everything was okay.
I assume this is doing something "Detected an exited process, so exiting main" this is an internal print of your code, I guess it just leaves the process with exitcode 0

  
  
Posted 2 years ago

Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the 

if name == "main":

  as seen above in the code snippet.

I'm not sure I follow, the error seems like your internal code issue, does that means clearml works as expected ?

  
  
Posted 2 years ago

Hi ClumsyElephant70
What's the clearml you are using ?
(The first error is a by product of python process.Event created before a forkserver is created, some internal python issue. I thought it was solved, let me take a look at the code you attached)

  
  
Posted 2 years ago

Where did you add the Task.init call ?

  
  
Posted 2 years ago

clearml_agent v1.0.0 and clearml v1.0.2

  
  
Posted 2 years ago

` if name == "main":

task = Task.init(project_name="dummy",
             task_name="pretraining",
             task_type=Task.TaskTypes.training,
             reuse_last_task_id=False)

task.connect(args)
print('Arguments: {}'.format(args))

# only create the task, we will actually execute it later
task.execute_remotely()

spawn_dist.run(args) `I added it to this script and use it as a starting point   https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
  
  
Posted 2 years ago

Actually I saw that the RuntimeError: context has already been set appears when the task is initialised outside if __name__ == "__main__":

  
  
Posted 2 years ago

I'm running now the the code shown above and will let you know if there is still an issue

  
  
Posted 2 years ago

RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting

  
  
Posted 2 years ago

I’m not sure if this was solved, but I am encountering a similar issue.

Yep, it was solved (I think v1.7+)

With

spawn

and

forkserver

(which is used in the script above) ClearML is not able to automatically capture PyTorch scalars and artifacts.

The "trick" is to have Task.init before you spawn your code, then (since your code will not start from the same state), you should call Task.current_task(), which would basically make sure everything is monitored.
(unfortunately patching spawn is trickier than fork so currently it need to be done manually)

  
  
Posted one year ago

Thank you @<1523701205467926528:profile|AgitatedDove14> . I have Task.init() right at the beginning of the script (i.e. before multiprocessing), but I don’t have the Task.urrent_task() call, so maybe that would solve the issue. Where should that be? In the function that is parallelised? Or can it also be right after Task.init() ?

  
  
Posted one year ago

Or can it also be right after

Task.init()

?

That would work as well 🙂

  
  
Posted one year ago

Hi @<1523701205467926528:profile|AgitatedDove14> ,
I can confirm that calling Task.current_task() makes ClearML log the console, models and scalars again 🙂

  
  
Posted one year ago

Thanks, I’ll try that and report back.

  
  
Posted one year ago
600 Views
20 Answers
2 years ago
one year ago
Tags
Similar posts