
Reputation
Badges 1
108 × Eureka!Hi, yes it's running with autoscaler so it's for sure in docker mode
Are you saying that it should've worked? I got 'docker' attribute doesn't exist error. Maybe it's the version of the clearml server?
i didn’t, prefer not to add temporary workarounds
@<1523701118159294464:profile|ExasperatedCrab78>
Hey again 🙂
I believe that the transformers patch wasn’t released yet right? we are getting into a problem where we need new features from transformers but can’t use because of this
This is the next step not being able to find the output of the last step
ValueError: Could not retrieve a local copy of artifact return_object, failed downloading
@<1523701435869433856:profile|SmugDolphin23> @<1523701087100473344:profile|SuccessfulKoala55> Yes, the second issue still consists, currently breaking our pipeline
I am currently on vacation, I'll ask my team mates. But if not I'll get to it next week
@<1523701118159294464:profile|ExasperatedCrab78>
Here is an example that reproduces the second error
from clearml.automation import PipelineDecorator
from clearml import TaskTypes
@PipelineDecorator.component(task_type=TaskTypes.data_processing, cache=True)
def run_demo():
from transformers import AutoTokenizer, DataCollatorForTokenClassification, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
import ...
TimelyMouse69
Thanks for the reply, this is only regarding automatic logging, where i want to disable logging all together (avoiding the task being added to the UI)
not sure about this, we really like being in control of reproducibility and not depend on the invoking machine… maybe that’s not what you intend
i believe this is because of transformer’s integration:
Automatic ClearML logging enabled.
ClearML Task has been initialized.
when a task already exists
when u say use Task.current_task()
you for logging? which i’m guessing that the fastai binding should do right?
they also appear to be relying on the tensorboard callback which seems not to work on distributed training
you can get updates on the issue i opened
https://github.com/fastai/fastai/issues/3543
but i think the probably better solution would be to create a custom ClearML callback for fastai with the best practices you think are needed…
Or try to fix the TensorBoardCallback, because for now we can’t use multi gpu because of it 😪
Hey 🙂 Thanks for the update!
what i’m missing the is the point where you report to clearml between cast and casting back 🤔
Nothing that i think is relevant, I'm using latest from master. It might be a new bug on their side, wasn't sure.
i’m following this guide
https://docs.fast.ai/distributed.html#Learner.distrib_ctx
so you run it like thispython -m fastai.launch <script>
Noting one difference i do is using TensorBoardCallback
, because i believe the clearml docs use an outdated fastai 1 version…
Glad to hear you were able to reproduce it! Waiting for your reply 🙏
@<1523701205467926528:profile|AgitatedDove14>
Only got some time to work on it now, i created a small reproducible example.
I also tried to use your suggestion with import accelerate, it also had issues.
overall, when using debug_pipeline
it works ok, but both methods don't work without it, i think it has something to do with wrapping accelerate.
Problem with launching through python module (your suggestion), the argparse breaks.
Problem with launching using a new process - rank0 proce...
@<1523701435869433856:profile|SmugDolphin23> @<1523701205467926528:profile|AgitatedDove14>
Any updates? 🙂
Yes, and the old version only works without the patch.
I see the model on the artifacts tab, but it's not actually uploaded.
Hey, it took me some to check out.
I added 20 retries to check gpu driver, it says it finds the driver, but still the task starts without gpu driver