Hey Everyone

they also appear to be relying on the tensorboard callback which seems not to work on distributed training

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

PricklyRaven28 basically this is the issue:

python -m fastai.launch <script>

There are multiple copies of the script running, but they are Not aware of one another.
are you getting any reporting from the diff GPUs? I'm assuming there is a hidden OS environment that signals the "master" node, so all processes can communicate with it. This is what we should automatically capture. There is a workaround the fastai.launch, that is probably similar to this one:

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

when u say use Task.current_task() you for logging? which i’m guessing that the fastai binding should do right?

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

because fastai’s tensorboard doesn’t work in multi gpu

keep me posted when this is solved, so we can also update the fastai2 interface,

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

There is a workaround the fastai.launch, that is probably similar to this one:

I think you can do the launching "manually", something like:
https://github.com/allegroai/clearml/blob/fd2d6c6f5d46cad3e406e88eeb4d805455b5b3d8/examples/frameworks/pytorch/pytorch_distributed_example.py#L160
At least until we understand how to fix it automatically

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

still getting 4 tasks (1 does have data in results)

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

i get for one of the tasks, but then it fails because it seems that the fastai2 tensorboardcallback isn’t fit for distributed training (which i’m opening an issue for them now)

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Can you share how your code looks in general? from the start?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

That’s how fastai distributes between multiple gpus, i’ll try to move the init

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

i’ll check if it’s already initialized

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

this

from fastai.callbacks.tensorboard import LearnerTensorboardWriter

doesn’t exist anymore in fastai2

Hmm we should definitely update the example to fastai2 API

maybe the fastai bindings in clearml package are outdated

Are you getting any scalars reported to clearml?

they also appear to be relying on the tensorboard callback which seems not to work on distributed training

Yes that is correct, usually the way it works all nodes report back to "master" node, and that one performs the TB writes.
PricklyRaven28 who is spinning the sub-processes in your example?
EDIT: sorry found it:
python -m fastai.launch <script>

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

TimelyPenguin76

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

reduced to a small snippet
` from fastai.vision.all import *
from fastai.distributed import *
from clearml import Task
from fastai.callback.tensorboard import TensorBoardCallback
from wwf.vision.timm import timm_learner

task = Task.init(project_name='LIOR_TEST', auto_connect_arg_parser={'rank': False})
path = untar_data(URLs.PETS)

size = 460
batch_size = 32

dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
get_y=lambda x: 'cat' if x.name[0].isupper() else 'dog',
item_tfms=Resize(size),
batch_tfms=aug_transforms(size=size))

dls = dblock.dataloaders(path, batch_size=batch_size)

learn = timm_learner(dls, 'tf_efficientnet_b3', metrics=[accuracy, Precision(average='micro'), Recall(average='micro')])
learn = learn.to_fp16()

with learn.distrib_ctx(sync_bn=False):
learn.fine_tune(300, 1e-3, cbs=[TensorBoardCallback(trace_model=False)]) `

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

this
from fastai.callbacks.tensorboard import LearnerTensorboardWriter
doesn’t exist anymore in fastai2

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

any idea? 🙏

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Interesting... AgitatedDove14 what do you think?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

and use that instead of the -m fastai.launch part ofcourse

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_distributed_example.py

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yes that should work, only thing is you need to call Task init on the master process (and make sure you call Task.current_task() on the subprocesses, if you want to automagic to kick in, that said, usually there is no need, they are supposed to report everything back to the main one anyhow
basically
@call_parse def main( gpus:Param("The GPUs to use for distributed training", str)='all', script:Param("Script to run", str, opt=False)='', args:Param("Args to pass to script", nargs='...', opt=False)='' ): "PyTorch distributed training launch helper that spawns multiple distributed processes" task = Task.init(...) current_env = os.environ.copy()

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

but that means that there is no way to work with clearml+fastai2+multi gpu

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

but anyway, this will still not work because fastai’s tensorboard doesn’t work in multi gpu 😞

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

when u say use

Task.current_task()

you for logging? which i’m guessing that the fastai binding should do right?

right, this is a fancy way to say, make sure the actual sub-process is initializing ClearML so all the automagic kicks in, since this is not "forked" but a whole new process, calling Task.current_task is the equivalent of calling Task.init with the same arguments (which you can also do, I'm not sure which one is more straight forward, wdyt?)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

How do you create the multi process part? is fastai forking the process ? or do you manually runit 4 times ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

you can get updates on the issue i opened
https://github.com/fastai/fastai/issues/3543

but i think the probably better solution would be to create a custom ClearML callback for fastai with the best practices you think are needed…

Or try to fix the TensorBoardCallback, because for now we can’t use multi gpu because of it 😪

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

AgitatedDove14 . so if i understand correctly, what i can possibly do is copy paste the https://github.com/fastai/fastai/blob/master/fastai/launch.py code and add the Task.init there?

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

unrelated, i’m getting a crash, but seems related to fastai and not clearml

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Noting one difference i do is using TensorBoardCallback , because i believe the clearml docs use an outdated fastai 1 version…

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

maybe the fastai bindings in clearml package are outdated

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

https://github.com/allegroai/clearml/blob/master/examples/frameworks/fastai/fastai_with_tensorboard.py

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

i’m following this guide
https://docs.fast.ai/distributed.html#Learner.distrib_ctx

so you run it like this
python -m fastai.launch <script>

  				
Posted 
	3 years ago

					More  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Answers 30