Allright, a bit of searching later and I've found 2 things:
- You were right about the task! I've staged a fix here . It basically detects whether a task is already running (e.g. from the pipelinedecorator component) and if so, uses that task instead. We should probably do this for all of our integrations.
- But then I found another bug. Basically the pipeline decorator task would mess up the internal nested dict of the label mapping inside of the model config. You will probably have the same issue if you run the pipeline with my fix above.
So for now, we're looking into the 2nd bug, because it breaks with Hugging Face models in a pipeline. Until we sort that out, I'm going to hold off on opening a PR to HF with the first fix. Makes sense?
Thanks a lot for the example, it helped tons to be able to reproduce!
BTW the code above is from clearml github so it’s the latest
It should, but please check first. This is some code I quickly made for myself. It did make tests for it, but it would be nice to hear from someone else that it worked (as evidenced by the error above 😅 )
Hey @<1523701949617147904:profile|PricklyRaven28> I'm checking! Have you updated anything else and on which exact commit of transformers are you now?
Thanks! I'm checking now, but might take a little (meeting in between)
@<1523701435869433856:profile|SmugDolphin23>
Hey 🙂
Any update?
We are having more issues with transformers and clearml in their new version.
The step that has transformers 4.25.1
isn’t able to upload artifacts.
If we downgrade transformers==4.21.3
it works
Looks like the first issue has been solved 🙂
i think the second one still consists, still checking
` from clearml.automation import PipelineDecorator
from clearml import TaskTypes
@PipelineDecorator.component(task_type=TaskTypes.data_processing, cache=True)
def run_demo():
from transformers import AutoTokenizer, DataCollatorForTokenClassification, AutoModelForTokenClassification, TrainingArguments, Trainer
from datasets import load_dataset
dataset = load_dataset("conllpp")
model_checkpoint = 'bert-base-cased'
lr = 2e-5
num_train_epochs = 5
weight_decay = 0.01
seed = 1234
ner_feature = dataset["train"].features["ner_tags"]
label_names = ner_feature.feature.names
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
model = AutoModelForTokenClassification.from_pretrained(
model_checkpoint,
id2label=id2label,
label2id=label2id,
)
trainer_args = TrainingArguments(
'./tmp',
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=lr,
num_train_epochs=num_train_epochs,
weight_decay=weight_decay,
seed=seed,
data_seed=seed,
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=trainer_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()
@PipelineDecorator.pipeline(name="StuffToDelete", project=".Dev", version="0.0.2", pipeline_execution_queue="aws_cpu")
def pipeline():
run_demo()
if name == 'main':
PipelineDecorator.set_default_execution_queue("aws_cpu")
PipelineDecorator.run_locally()
pipeline() `
This isn’t a real working example, but it shows that on clearml 1.7.2 it passed initialization part (and has an error on training stuff which is ok)
And on 1.9.0 it errors before onTypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'
SmugDolphin23 BTW, this is using clearml and huggingface’s automatic logging… didn’t log something manual
i believe this is because of this code
None
Which initialized the task if clearml is installed… but a task already exists (because of the pipeline), it will replace it
However, I actually do think I can already open the Huggingface PR in the meantime. It has actually relatively little to do with the second bug.
sounds good 🙂 I’ll soon check if this fixes our issue and update you
Hey 🙂 Thanks for the update!
what i’m missing the is the point where you report to clearml between cast and casting back 🤔
Hi PricklyRaven28 , can you try with 1.9.1rc0?
I'm getting really weird behavior now, the task seems to report correctly with the patch... but the step doesn't say "uploading" when finished... there is a "return" artifact but it doesn't exist on S3 (our file server configuration)
yeah, it gets to that error because the previous issue is saved…i’ll try to work on a new example
@<1523701435869433856:profile|SmugDolphin23> @<1523701087100473344:profile|SuccessfulKoala55> Yes, the second issue still consists, currently breaking our pipeline
` args.py #504:
for k, v in dictionary.items():
# if key is not present in the task's parameters, assume we didn't get this far when running
# in non-remote mode, and just add it to the task's parameters
if k not in parameters:
self._task.set_parameter((prefix or '') + k, v)
continue
task.py #1266:
def set_parameter(self, name, value, description=None, value_type=None):
# type: (str, str, Optional[str], Optional[Any]) -> ()
"""
Set a single Task parameter. This overrides any previous value for this parameter.
:param name: The parameter name.
:param value: The parameter value.
:param description: The parameter description.
:param value_type: The type of the parameters (cast to string and store)
"""
if not Session.check_min_api_version('2.9'):
# not supported yet
description = None
value_type = None
self._set_parameters(
{name: value}, __update=True,
__parameters_descriptions={name: description},
__parameters_types={name: value_type}
)
task.py #1227:
def create_description():
if org_param and org_param.description:
return org_param.description
created_description = ""
if org_k in descriptions:
created_description = descriptions[org_k]
if isinstance(v, Enum):
# append enum values to description
if created_description:
created_description += "\n"
created_description += "Values:\n" + ",\n".join(
[enum_key for enum_key in type(v).dict.keys() if not enum_key.startswith("_")]
)
return created_description `We can see from this code that the description will always be None (because copy_to_dict never passes a description, it defaults to None and is always put in the descriptions dict as None), and if the arg is an Enum it will always throw the exception
Hey @<1523701949617147904:profile|PricklyRaven28> , So as discussed above there were 2 issues. The first one is still waiting on the second, it's on the backlog of our devs and should be done soon(tm).
That said, in the meantime I also wanted to do fun stuff with transformers, so I've written a quick hack that deals with the bug. It's bascially 2 functions that keep track of which types of keys are in the dict.
def cast_keys_to_string(d, changed_keys=dict()):
nd = dict()
for key in d.keys():
if not isinstance(key, str):
casted_key = str(key)
changed_keys[casted_key] = key
else:
casted_key = key
if isinstance(d[key], dict):
nd[casted_key], changed_keys = cast_keys_to_string(d[key], changed_keys)
else:
nd[casted_key] = d[key]
return nd, changed_keys
def cast_keys_back(d, changed_keys):
nd = dict()
for key in d.keys():
if key in changed_keys:
original_key = changed_keys[key]
else:
original_key = key
if isinstance(d[key], dict):
nd[original_key], changed_keys = cast_keys_back(d[key], changed_keys)
else:
nd[original_key] = d[key]
return nd, changed_keys
You can then use them like this:
training_args = TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
dataloader_num_workers=0,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True
)
# Allow ClearML access to the training args and allow it to override the arguments for remote execution
args_class = type(training_args)
args, changed_keys = cast_keys_to_string(training_args.to_dict())
training_args = args_class(**cast_keys_back(args, changed_keys)[0])
self.trainer = Trainer(
model=self.model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=self.tokenizer,
data_collator=data_collator,
compute_metrics=self.compute_metrics,
)
self.trainer.train()
This "hack" in combination with the patch to Huggingface from above should work 🙂 That said, it is a hack, so a production version of this should be there soon. I'll let you know when that happens!
I tried to work on a reproducible script but then i get errors that my clearml task is already initialized (also doesn’t happen on 1.7.2)
in the meantime, we should have fixed this. I will ping you when 1.9.1 is out to try it out!
Hi @<1523701949617147904:profile|PricklyRaven28> ! We released ClearmlSDK 1.9.1 yesterday. Can you please try it?
Damn it, you're right 😅
# Allow ClearML access to the training args and allow it to override the arguments for remote execution
args_class = type(training_args)
args, changed_keys = cast_keys_to_string(training_args.to_dict())
Task.current_task().connect(args)
training_args = args_class(**cast_keys_back(args, changed_keys)[0])
@<1523701118159294464:profile|ExasperatedCrab78>
Ok. bummer to hear that it won't be included automatically in the package.
I am now experiencing a bug with the patch, not sure it's to blame... but i'm unable to save models in the pipeline.. checking if it's related
Now worries! Just so I understand fully though: you were already using the patch with success from my branch. Now that it has been merged into transformers main branch you installed it from there and that's when you started having issues with not saving models? Then installing transformers 4.21.3 fixes it (which should have the old clearml integration even before the patch?)