Hey, We Are Using Clearml 1.9.0 With Transformers 4.25.1… And We Started Getting Errors That Do Not Reproduce In Earlier Versions (Only Works In 1.7.2 All 1.8.X Don’T Work):

Answered

Hey,
We are using clearml 1.9.0 with transformers 4.25.1… and we started getting errors that do not reproduce in earlier versions (only works in 1.7.2 all 1.8.x don’t work):

File "/tmp/tmp0you5mai.py", line 29, in train_entity_exraction_model train(source=source_path.absolute(), output=model_output_path.absolute(), seed=seed, **entity_extraction_trainer) File "/usr/src/lib/entity_extractions/train.py", line 74, in train trainer.train() File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1527, in train return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1704, in _inner_training_loop self.control = self.callback_handler.on_train_begin(args, self.state, self.control) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer_callback.py", line 353, in on_train_begin return self.call_event("on_train_begin", args, state, control) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer_callback.py", line 397, in call_event result = getattr(callback, event)( File "/opt/conda/lib/python3.10/site-packages/transformers/integrations.py", line 1355, in on_train_begin self.setup(args, state, model, tokenizer, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/integrations.py", line 1345, in setup self._clearml_task.connect(args, "Args") File "/opt/conda/lib/python3.10/site-packages/clearml/task.py", line 1480, in connect return method(mutable, name=name) File "/opt/conda/lib/python3.10/site-packages/clearml/task.py", line 3449, in _connect_object a_dict = self._connect_dictionary(a_dict, name) File "/opt/conda/lib/python3.10/site-packages/clearml/task.py", line 3413, in _connect_dictionary flat_dict = self._arguments.copy_to_dict(flat_dict, prefix=name) File "/opt/conda/lib/python3.10/site-packages/clearml/backend_interface/task/args.py", line 508, in copy_to_dict self._task.set_parameter((prefix or '') + k, v) File "/opt/conda/lib/python3.10/site-packages/clearml/backend_interface/task/task.py", line 1281, in set_parameter self._set_parameters( File "/opt/conda/lib/python3.10/site-packages/clearml/backend_interface/task/task.py", line 1246, in _set_parameters description=create_description(), File "/opt/conda/lib/python3.10/site-packages/clearml/backend_interface/task/task.py", line 1237, in create_description created_description += "Values:\n" + ",\n".join( TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Votes Newest

Answers 62

@<1523701435869433856:profile|SmugDolphin23>
Hey 🙂
Any update?

We are having more issues with transformers and clearml in their new version.
The step that has transformers 4.25.1 isn’t able to upload artifacts.
If we downgrade transformers==4.21.3 it works

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

for now we downgraded to 1.7.2, but of course prefer not to stay that way

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Hey 🙂 Thanks for the update!

what i’m missing the is the point where you report to clearml between cast and casting back 🤔

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Saw it was merged 🙂 One down, one to go

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

When creating it, I found that this hack should be on our side, not on Huggingface's. So I'm only going to fix issue 1 with the PR, issue 2 is ours 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Good to hear 🙏

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

However, I actually do think I can already open the Huggingface PR in the meantime. It has actually relatively little to do with the second bug.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Allright, a bit of searching later and I've found 2 things:

You were right about the task! I've staged a fix here . It basically detects whether a task is already running (e.g. from the pipelinedecorator component) and if so, uses that task instead. We should probably do this for all of our integrations.
But then I found another bug. Basically the pipeline decorator task would mess up the internal nested dict of the label mapping inside of the model config. You will probably have the same issue if you run the pipeline with my fix above.
So for now, we're looking into the 2nd bug, because it breaks with Hugging Face models in a pipeline. Until we sort that out, I'm going to hold off on opening a PR to HF with the first fix. Makes sense?

Thanks a lot for the example, it helped tons to be able to reproduce!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Hey @<1523701949617147904:profile|PricklyRaven28> , So as discussed above there were 2 issues. The first one is still waiting on the second, it's on the backlog of our devs and should be done soon(tm).

That said, in the meantime I also wanted to do fun stuff with transformers, so I've written a quick hack that deals with the bug. It's bascially 2 functions that keep track of which types of keys are in the dict.

def cast_keys_to_string(d, changed_keys=dict()):
    nd = dict()
    for key in d.keys():
        if not isinstance(key, str):
            casted_key = str(key)
            changed_keys[casted_key] = key
        else:
            casted_key = key
        if isinstance(d[key], dict):
            nd[casted_key], changed_keys = cast_keys_to_string(d[key], changed_keys)
        else:
            nd[casted_key] = d[key]
    return nd, changed_keys

def cast_keys_back(d, changed_keys):
    nd = dict()
    for key in d.keys():
        if key in changed_keys:
            original_key = changed_keys[key]
        else:
            original_key = key
        if isinstance(d[key], dict):
            nd[original_key], changed_keys = cast_keys_back(d[key], changed_keys)
        else:
            nd[original_key] = d[key]
    return nd, changed_keys

You can then use them like this:

        training_args = TrainingArguments(
            output_dir="my_awesome_model",
            learning_rate=2e-5,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            dataloader_num_workers=0,
            num_train_epochs=2,
            weight_decay=0.01,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True
        )

        # Allow ClearML access to the training args and allow it to override the arguments for remote execution
        args_class = type(training_args)
        args, changed_keys = cast_keys_to_string(training_args.to_dict())
        training_args = args_class(**cast_keys_back(args, changed_keys)[0])

        self.trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=tokenized_dataset["train"],
            eval_dataset=tokenized_dataset["test"],
            tokenizer=self.tokenizer,
            data_collator=data_collator,
            compute_metrics=self.compute_metrics,
        )

        self.trainer.train()

This "hack" in combination with the patch to Huggingface from above should work 🙂 That said, it is a hack, so a production version of this should be there soon. I'll let you know when that happens!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

No worries! And thanks for putting in the time.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

i believe this is because of this code
None

Which initialized the task if clearml is installed… but a task already exists (because of the pipeline), it will replace it

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

` args.py #504:
for k, v in dictionary.items():
# if key is not present in the task's parameters, assume we didn't get this far when running
# in non-remote mode, and just add it to the task's parameters
if k not in parameters:
self._task.set_parameter((prefix or '') + k, v)
continue

task.py #1266:
def set_parameter(self, name, value, description=None, value_type=None):
# type: (str, str, Optional[str], Optional[Any]) -> ()
"""
Set a single Task parameter. This overrides any previous value for this parameter.

    :param name: The parameter name.
    :param value: The parameter value.
    :param description: The parameter description.
    :param value_type: The type of the parameters (cast to string and store)
    """
    if not Session.check_min_api_version('2.9'):
        # not supported yet
        description = None
        value_type = None

    self._set_parameters(
        {name: value}, __update=True,
        __parameters_descriptions={name: description},
        __parameters_types={name: value_type}
    )

task.py #1227:
def create_description():
if org_param and org_param.description:
return org_param.description
created_description = ""
if org_k in descriptions:
created_description = descriptions[org_k]
if isinstance(v, Enum):
# append enum values to description
if created_description:
created_description += "\n"
created_description += "Values:\n" + ",\n".join(
[enum_key for enum_key in type(v).dict.keys() if not enum_key.startswith("_")]
)
return created_description `We can see from this code that the description will always be None (because copy_to_dict never passes a description, it defaults to None and is always put in the descriptions dict as None), and if the arg is an Enum it will always throw the exception

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Hey @<1523701949617147904:profile|PricklyRaven28> , about the S3 loading issue. The path to the model in the artifact tab, is it an S3 bucket or a local path?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticShrimp49
				
					0

Just for reference, the main issue is that ClearML does not allow non-string types as dict keys for its configuration. Usually the labeling mapping does have ints as keys. Which is why we need to cast them to strings first, then pass them to ClearML then cast them back.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

will check

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

BTW the code above is from clearml github so it’s the latest

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Hi @<1523701949617147904:profile|PricklyRaven28> ! We released ClearmlSDK 1.9.1 yesterday. Can you please try it?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

@<1523701118159294464:profile|ExasperatedCrab78>
Hey 🙂
Any updates on this? We need to use a new version of transformers because of another bug they have in an old version. so we can’t use the old transformers version anymore.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Hi PricklyRaven28 , can you try with 1.9.1rc0?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi PricklyRaven28 ! What dict do you connect? Do you have a small script we could use to reproduce?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

We'll check it out 👍

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Looks like the first issue has been solved 🙂

i think the second one still consists, still checking

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Hey @<1523701949617147904:profile|PricklyRaven28> I'm checking! Have you updated anything else and on which exact commit of transformers are you now?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

@<1523701435869433856:profile|SmugDolphin23> @<1523701087100473344:profile|SuccessfulKoala55> Yes, the second issue still consists, currently breaking our pipeline

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

It should, but please check first. This is some code I quickly made for myself. It did make tests for it, but it would be nice to hear from someone else that it worked (as evidenced by the error above 😅 )

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

S3 as it should be

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

This is the next step not being able to find the output of the last step

ValueError: Could not retrieve a local copy of artifact return_object, failed downloading

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Could you please run the misbehaving example, try to add a breakpoint in clearml/backend_interface/task/task.py in Task.update_output_model on the line with url = output_model.update_weights( , and tell me what the value of model_path is? In case you're using virtual environments, clearml library should be installed somewhere in <virtual env directory>/lib/python3.10/site-packages/clearml/

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticShrimp49
				
					0

Thanks! I'm checking now, but might take a little (meeting in between)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

sounds good 🙂 I’ll soon check if this fixes our issue and update you

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Show more results

Write your answer

170K Views

62 Answers

2 years ago