they also appear to be relying on the tensorboard callback which seems not to work on distributed training
PricklyRaven28 basically this is the issue:
python -m fastai.launch <script>
There are multiple copies of the script running, but they are Not aware of one another.
are you getting any reporting from the diff GPUs? I'm assuming there is a hidden OS environment that signals the "master" node, so all processes can communicate with it. This is what we should automatically capture. There is a workaround the fastai.launch, that is probably similar to this one:
when u say use Task.current_task()
you for logging? which i’m guessing that the fastai binding should do right?
because fastai’s tensorboard doesn’t work in multi gpu
keep me posted when this is solved, so we can also update the fastai2 interface,
- There is a workaround the fastai.launch, that is probably similar to this one:
I think you can do the launching "manually", something like:
https://github.com/allegroai/clearml/blob/fd2d6c6f5d46cad3e406e88eeb4d805455b5b3d8/examples/frameworks/pytorch/pytorch_distributed_example.py#L160
At least until we understand how to fix it automatically
still getting 4 tasks (1 does have data in results)
i get for one of the tasks, but then it fails because it seems that the fastai2 tensorboardcallback isn’t fit for distributed training (which i’m opening an issue for them now)
Can you share how your code looks in general? from the start?
That’s how fastai distributes between multiple gpus, i’ll try to move the init
this
from fastai.callbacks.tensorboard import LearnerTensorboardWriter
doesn’t exist anymore in fastai2
Hmm we should definitely update the example to fastai2 API
maybe the fastai bindings in clearml package are outdated
Are you getting any scalars reported to clearml?
they also appear to be relying on the tensorboard callback which seems not to work on distributed training
Yes that is correct, usually the way it works all nodes report back to "master" node, and that one performs the TB writes.
PricklyRaven28 who is spinning the sub-processes in your example?
EDIT: sorry found it:python -m fastai.launch <script>
reduced to a small snippet
` from fastai.vision.all import *
from fastai.distributed import *
from clearml import Task
from fastai.callback.tensorboard import TensorBoardCallback
from wwf.vision.timm import timm_learner
task = Task.init(project_name='LIOR_TEST', auto_connect_arg_parser={'rank': False})
path = untar_data(URLs.PETS)
size = 460
batch_size = 32
dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
get_y=lambda x: 'cat' if x.name[0].isupper() else 'dog',
item_tfms=Resize(size),
batch_tfms=aug_transforms(size=size))
dls = dblock.dataloaders(path, batch_size=batch_size)
learn = timm_learner(dls, 'tf_efficientnet_b3', metrics=[accuracy, Precision(average='micro'), Recall(average='micro')])
learn = learn.to_fp16()
with learn.distrib_ctx(sync_bn=False):
learn.fine_tune(300, 1e-3, cbs=[TensorBoardCallback(trace_model=False)]) `
thisfrom fastai.callbacks.tensorboard import LearnerTensorboardWriter
doesn’t exist anymore in fastai2
Interesting... AgitatedDove14 what do you think?
and use that instead of the -m fastai.launch
part ofcourse
Yes that should work, only thing is you need to call Task init on the master process (and make sure you call Task.current_task() on the subprocesses, if you want to automagic to kick in, that said, usually there is no need, they are supposed to report everything back to the main one anyhow
basically@call_parse def main( gpus:Param("The GPUs to use for distributed training", str)='all', script:Param("Script to run", str, opt=False)='', args:Param("Args to pass to script", nargs='...', opt=False)='' ): "PyTorch distributed training launch helper that spawns multiple distributed processes" task = Task.init(...) current_env = os.environ.copy()
but that means that there is no way to work with clearml+fastai2+multi gpu
but anyway, this will still not work because fastai’s tensorboard doesn’t work in multi gpu 😞
when u say use
Task.current_task()
you for logging? which i’m guessing that the fastai binding should do right?
right, this is a fancy way to say, make sure the actual sub-process is initializing ClearML so all the automagic kicks in, since this is not "forked" but a whole new process, calling Task.current_task is the equivalent of calling Task.init with the same arguments (which you can also do, I'm not sure which one is more straight forward, wdyt?)
How do you create the multi process part? is fastai forking the process ? or do you manually runit 4 times ?
you can get updates on the issue i opened
https://github.com/fastai/fastai/issues/3543
but i think the probably better solution would be to create a custom ClearML callback for fastai with the best practices you think are needed…
Or try to fix the TensorBoardCallback, because for now we can’t use multi gpu because of it 😪
AgitatedDove14 . so if i understand correctly, what i can possibly do is copy paste the https://github.com/fastai/fastai/blob/master/fastai/launch.py code and add the Task.init there?
unrelated, i’m getting a crash, but seems related to fastai and not clearml
Noting one difference i do is using TensorBoardCallback
, because i believe the clearml docs use an outdated fastai 1 version…
maybe the fastai bindings in clearml package are outdated
i’m following this guide
https://docs.fast.ai/distributed.html#Learner.distrib_ctx
so you run it like thispython -m fastai.launch <script>