This happens inside the agent, since I use task.execute_remotely() I guess. The agent runs on ubuntu 18.04 and not in docker mode
using this code in https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run(args) `I get this error
RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting
but this tensor size error is probably caused by my code and not clearml. Still I wonder if it is normal behaviour that clearml exits the experiments with status "completed" and not with failure, if a RuntimeError occurs in a child process
Still I wonder if it is normal behavior that clearml exits the experiments with status "completed" and not with failure
Well that depends on the process exit code, if for some reason (not sure why) the process exits with return code 0, it means everything was okay.
I assume this is doing something "Detected an exited process, so exiting main" this is an internal print of your code, I guess it just leaves the process with exitcode 0
My code produces now an error inside one of the threads, but that should be an issue on my side. Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the if __name__ == "__main__":
as seen above in the code snippet.
RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting
I'm running now the the code shown above and will let you know if there is still an issue
clearml_agent v1.0.0 and clearml v1.0.2
Actually I saw that the
RuntimeError: context has already been set
appears when the task is initialised outside
if name == "main":
Is this when you execute the code or when the agent ?
Also what's the OS of your machine/ agent ?
Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the
if name == "main":
as seen above in the code snippet.
I'm not sure I follow, the error seems like your internal code issue, does that means clearml works as expected ?
Thank you @<1523701205467926528:profile|AgitatedDove14> . I have Task.init()
right at the beginning of the script (i.e. before multiprocessing), but I don’t have the Task.urrent_task()
call, so maybe that would solve the issue. Where should that be? In the function that is parallelised? Or can it also be right after Task.init()
?
Hi ClumsyElephant70
What's the clearml
you are using ?
(The first error is a by product of python process.Event created before a forkserver is created, some internal python issue. I thought it was solved, let me take a look at the code you attached)
I’m not sure if this was solved, but I am encountering a similar issue.
Yep, it was solved (I think v1.7+)
With
spawn
and
forkserver
(which is used in the script above) ClearML is not able to automatically capture PyTorch scalars and artifacts.
The "trick" is to have Task.init before you spawn your code, then (since your code will not start from the same state), you should call Task.current_task(), which would basically make sure everything is monitored.
(unfortunately patching spawn is trickier than fork so currently it need to be done manually)
Hey AgitatedDove14 , I fixed my code issue and are now able to train on multiple gpus using the https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py . Since I create the ClearML Task in the main thread I now can't see any training plots and probably also not the output model. What would be the right approach? I would like to avoid using Task.current_task().upload_artifact()
or manual logging. I really enjoy the automatic detection
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run(args) `I added it to this script and use it as a starting point https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
Hi @<1523701205467926528:profile|AgitatedDove14> ,
I can confirm that calling Task.current_task()
makes ClearML log the console, models and scalars again 🙂
Or can it also be right after
Task.init()
?
That would work as well 🙂
Actually I saw that the RuntimeError: context has already been set
appears when the task is initialised outside if __name__ == "__main__":
I’m not sure if this was solved, but I am encountering a similar issue. From what I see it all depends on what multiprocessing start method is used.
When using fork
ClearML works fine and it’s able to capture everything, however it is not recommended to use fork
as it is not safe with multithreading (e.g. see None ).
With spawn
and forkserver
(which is used in the script above) ClearML is not able to automatically capture PyTorch scalars and artifacts. For what concerns spawn
I think the reason is that a whole new python process is started from scratch. For forkserver
I am not sure on the reason, as it should inherit some of the parent process memory.
Did anyone else encounter this issue in the meantime? Any solutions?