Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey Everyone

Hey everyone šŸ™‚
Iā€™m trying to use a ClearML on prem for experiment visualization only, having some issues with multi GPU.

It seems that clearml is creating a different task for each gpu (in my case 4) and none have data in ā€œresultsā€ tab.

Any idea?

  
  
Posted 2 years ago
Votes Newest

Answers 30


because fastaiā€™s tensorboard doesnā€™t work in multi gpu

keep me posted when this is solved, so we can also update the fastai2 interface,

  
  
Posted 2 years ago

Can you share how your code looks in general? from the start?

  
  
Posted 2 years ago

maybe the fastai bindings in clearml package are outdated

  
  
Posted 2 years ago

Yes that should work, only thing is you need to call Task init on the master process (and make sure you call Task.current_task() on the subprocesses, if you want to automagic to kick in, that said, usually there is no need, they are supposed to report everything back to the main one anyhow
basically
@call_parse def main( Ā  Ā gpus:Param("The GPUs to use for distributed training", str)='all', Ā  Ā script:Param("Script to run", str, opt=False)='', Ā  Ā args:Param("Args to pass to script", nargs='...', opt=False)='' ): Ā  Ā "PyTorch distributed training launch helper that spawns multiple distributed processes" task = Task.init(...) Ā  Ā current_env = os.environ.copy()

  
  
Posted 2 years ago

i get for one of the tasks, but then it fails because it seems that the fastai2 tensorboardcallback isnā€™t fit for distributed training (which iā€™m opening an issue for them now)

  
  
Posted 2 years ago

reduced to a small snippet
` from fastai.vision.all import *
from fastai.distributed import *
from clearml import Task
from fastai.callback.tensorboard import TensorBoardCallback
from wwf.vision.timm import timm_learner

task = Task.init(project_name='LIOR_TEST', auto_connect_arg_parser={'rank': False})
path = untar_data(URLs.PETS)

size = 460
batch_size = 32

dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
get_y=lambda x: 'cat' if x.name[0].isupper() else 'dog',
item_tfms=Resize(size),
batch_tfms=aug_transforms(size=size))

dls = dblock.dataloaders(path, batch_size=batch_size)

learn = timm_learner(dls, 'tf_efficientnet_b3', metrics=[accuracy, Precision(average='micro'), Recall(average='micro')])
learn = learn.to_fp16()

with learn.distrib_ctx(sync_bn=False):
learn.fine_tune(300, 1e-3, cbs=[TensorBoardCallback(trace_model=False)]) `

  
  
Posted 2 years ago

  1. There is a workaround the fastai.launch, that is probably similar to this one:

I think you can do the launching "manually", something like:
https://github.com/allegroai/clearml/blob/fd2d6c6f5d46cad3e406e88eeb4d805455b5b3d8/examples/frameworks/pytorch/pytorch_distributed_example.py#L160
At least until we understand how to fix it automatically

  
  
Posted 2 years ago

and use that instead of the -m fastai.launch part ofcourse

  
  
Posted 2 years ago

still getting 4 tasks (1 does have data in results)

  
  
Posted 2 years ago

TimelyPenguin76

  
  
Posted 2 years ago

they also appear to be relying on the tensorboard callback which seems not to work on distributed training

  
  
Posted 2 years ago

PricklyRaven28 basically this is the issue:

python -m fastai.launch <script>

There are multiple copies of the script running, but they are Not aware of one another.
are you getting any reporting from the diff GPUs? I'm assuming there is a hidden OS environment that signals the "master" node, so all processes can communicate with it. This is what we should automatically capture. There is a workaround the fastai.launch, that is probably similar to this one:

  
  
Posted 2 years ago

unrelated, iā€™m getting a crash, but seems related to fastai and not clearml

  
  
Posted 2 years ago

iā€™ll check if itā€™s already initialized

  
  
Posted 2 years ago

Noting one difference i do is using TensorBoardCallback , because i believe the clearml docs use an outdated fastai 1 versionā€¦

  
  
Posted 2 years ago

when u say use Task.current_task() you for logging? which iā€™m guessing that the fastai binding should do right?

  
  
Posted 2 years ago

but anyway, this will still not work because fastaiā€™s tensorboard doesnā€™t work in multi gpu šŸ˜ž

  
  
Posted 2 years ago

but that means that there is no way to work with clearml+fastai2+multi gpu

  
  
Posted 2 years ago

Thatā€™s how fastai distributes between multiple gpus, iā€™ll try to move the init

  
  
Posted 2 years ago

you can get updates on the issue i opened
https://github.com/fastai/fastai/issues/3543

but i think the probably better solution would be to create a custom ClearML callback for fastai with the best practices you think are neededā€¦

Or try to fix the TensorBoardCallback, because for now we canā€™t use multi gpu because of it šŸ˜Ŗ

  
  
Posted 2 years ago

iā€™m following this guide
https://docs.fast.ai/distributed.html#Learner.distrib_ctx

so you run it like this
python -m fastai.launch <script>

  
  
Posted 2 years ago

AgitatedDove14 . so if i understand correctly, what i can possibly do is copy paste the https://github.com/fastai/fastai/blob/master/fastai/launch.py code and add the Task.init there?

  
  
Posted 2 years ago

any idea? šŸ™

  
  
Posted 2 years ago

Interesting... AgitatedDove14 what do you think?

  
  
Posted 2 years ago

when u say useĀ 

Task.current_task()

Ā you for logging? which iā€™m guessing that the fastai binding should do right?

right, this is a fancy way to say, make sure the actual sub-process is initializing ClearML so all the automagic kicks in, since this is not "forked" but a whole new process, calling Task.current_task is the equivalent of calling Task.init with the same arguments (which you can also do, I'm not sure which one is more straight forward, wdyt?)

  
  
Posted 2 years ago

this

from fastai.callbacks.tensorboard import LearnerTensorboardWriter

doesnā€™t exist anymore in fastai2

Hmm we should definitely update the example to fastai2 API

maybe the fastai bindings in clearml package are outdated

Are you getting any scalars reported to clearml?

they also appear to be relying on the tensorboard callback which seems not to work on distributed training

Yes that is correct, usually the way it works all nodes report back to "master" node, and that one performs the TB writes.
PricklyRaven28 who is spinning the sub-processes in your example?
EDIT: sorry found it:
python -m fastai.launch <script>

  
  
Posted 2 years ago

How do you create the multi process part? is fastaiĀ forking the process ? or do you manually runit 4 times ?

  
  
Posted 2 years ago

this
from fastai.callbacks.tensorboard import LearnerTensorboardWriter
doesnā€™t exist anymore in fastai2

  
  
Posted 2 years ago
965 Views
30 Answers
2 years ago
one year ago
Tags