If nothing specific comes to mind i can try to create some reproducible demo code (after holiday vacation)
Yes please! 🙏
In the mean time see if the workaround is a valid one
How does this work in the context of a pipeline? One of the steps is a multi gpu training that requires accelerate.
If nothing specific comes to mind i can try to create some reproducible demo code (after holiday vacation)
@<1523701949617147904:profile|PricklyRaven28> Can you please try clearml==1.16.2rc0
? We have released a fix that will hopefully solve your problem
We used subprocess for it, ...
Popen? os.system? fork?
How does this work in the context of a pipeline?
Is your pipeline from functions / decorators ? or is it from Tasks ?
(if this is Tasks then just changing the entry point in the overides)
In case of functions or decorators, you have to do that manually (i.e. your function needs to do "accelerate launch"
from accelerate.commands.launch import launch_command, launch_command_parser
parser = launch_command_parser()
args = parser.parse_args("-command -here".split())
launch_command(args)
It's with decorators.
Interesting, i wasn't aware of this python module for executing accelerate. I'll try to use that.
We used subprocess for it, but for some reason only when invoked in the pipeline the process freezes and doesn't close the main accelerate process. Works fine outside of clearml, any Idea?
Hi @<1523701949617147904:profile|PricklyRaven28>
Sorry, we missed that one
we need to invoke it with
accelerate launch
so we use
subprocess.run
So you have two options, either you change the script entry of the Task from your " script.py
" to" -m accelerate launch script.py
or you manually do that inside your entry point (i.e. call accelerate launch)
BTW, I "think" we added an "auto detect" for it, so that if you launched it manually this way it will know to register it as " -m accelerate launch ...
"
Glad to hear you were able to reproduce it! Waiting for your reply 🙏
Hi @<1523701435869433856:profile|SmugDolphin23>
Confirming that rank0 process does not hang with the new version!
The accelerate CLI problem does still reproduce though (it's in my demo)
@<1523701949617147904:profile|PricklyRaven28> thank you for the feedback. We will investigate this further
@<1523701205467926528:profile|AgitatedDove14>
Only got some time to work on it now, i created a small reproducible example.
I also tried to use your suggestion with import accelerate, it also had issues.
overall, when using debug_pipeline
it works ok, but both methods don't work without it, i think it has something to do with wrapping accelerate.
Problem with launching through python module (your suggestion), the argparse breaks.
Problem with launching using a new process - rank0 process hangs and never finishes.
Both work fine with debug_pipeline
to make it very reproducible, i created a docker file for it, so make sure to run build_docker.sh
and then run.sh
Thank you @<1523701949617147904:profile|PricklyRaven28> !!!
Let me see if we can reproduce and how to solve it
Hi @<1523701949617147904:profile|PricklyRaven28> ! Thank you for the example. We managed to reproduce. We will investigate further to figure out the issue
@<1523701435869433856:profile|SmugDolphin23> @<1523701205467926528:profile|AgitatedDove14>
Any updates? 🙂