Hey, We Are Using Clearml 1.9.0 With Transformers 4.25.1… And We Started Getting Errors That Do Not Reproduce In Earlier Versions (Only Works In 1.7.2 All 1.8.X Don’T Work):

Answered

Hey,
We are using clearml 1.9.0 with transformers 4.25.1… and we started getting errors that do not reproduce in earlier versions (only works in 1.7.2 all 1.8.x don’t work):

File "/tmp/tmp0you5mai.py", line 29, in train_entity_exraction_model train(source=source_path.absolute(), output=model_output_path.absolute(), seed=seed, **entity_extraction_trainer) File "/usr/src/lib/entity_extractions/train.py", line 74, in train trainer.train() File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1527, in train return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1704, in _inner_training_loop self.control = self.callback_handler.on_train_begin(args, self.state, self.control) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer_callback.py", line 353, in on_train_begin return self.call_event("on_train_begin", args, state, control) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer_callback.py", line 397, in call_event result = getattr(callback, event)( File "/opt/conda/lib/python3.10/site-packages/transformers/integrations.py", line 1355, in on_train_begin self.setup(args, state, model, tokenizer, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/integrations.py", line 1345, in setup self._clearml_task.connect(args, "Args") File "/opt/conda/lib/python3.10/site-packages/clearml/task.py", line 1480, in connect return method(mutable, name=name) File "/opt/conda/lib/python3.10/site-packages/clearml/task.py", line 3449, in _connect_object a_dict = self._connect_dictionary(a_dict, name) File "/opt/conda/lib/python3.10/site-packages/clearml/task.py", line 3413, in _connect_dictionary flat_dict = self._arguments.copy_to_dict(flat_dict, prefix=name) File "/opt/conda/lib/python3.10/site-packages/clearml/backend_interface/task/args.py", line 508, in copy_to_dict self._task.set_parameter((prefix or '') + k, v) File "/opt/conda/lib/python3.10/site-packages/clearml/backend_interface/task/task.py", line 1281, in set_parameter self._set_parameters( File "/opt/conda/lib/python3.10/site-packages/clearml/backend_interface/task/task.py", line 1246, in _set_parameters description=create_description(), File "/opt/conda/lib/python3.10/site-packages/clearml/backend_interface/task/task.py", line 1237, in create_description created_description += "Values:\n" + ",\n".join( TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Votes Newest

Answers 62

I am currently on vacation, I'll ask my team mates. But if not I'll get to it next week

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Could you please run the misbehaving example, try to add a breakpoint in clearml/backend_interface/task/task.py in Task.update_output_model on the line with url = output_model.update_weights( , and tell me what the value of model_path is? In case you're using virtual environments, clearml library should be installed somewhere in <virtual env directory>/lib/python3.10/site-packages/clearml/

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticShrimp49
				
					0

S3 as it should be

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Hey @<1523701949617147904:profile|PricklyRaven28> , about the S3 loading issue. The path to the model in the artifact tab, is it an S3 bucket or a local path?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticShrimp49
				
					0

Good to hear 🙏

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Hi @<1523701949617147904:profile|PricklyRaven28> just letting you know I still have this on my TODO, I'll update you as soon as I have something!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Yes, and the old version only works without the patch.
I see the model on the artifacts tab, but it's not actually uploaded.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Now worries! Just so I understand fully though: you were already using the patch with success from my branch. Now that it has been merged into transformers main branch you installed it from there and that's when you started having issues with not saving models? Then installing transformers 4.21.3 fixes it (which should have the old clearml integration even before the patch?)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

I appreciate it!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Nothing that i think is relevant, I'm using latest from master. It might be a new bug on their side, wasn't sure.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Hey @<1523701949617147904:profile|PricklyRaven28> I'm checking! Have you updated anything else and on which exact commit of transformers are you now?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

confirming that only downgrading to transformers==4.21.3 without the patch worked....
This is a time bomb that eventually we won't be able to ignore... we will need to use new transformers code

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

I'm getting really weird behavior now, the task seems to report correctly with the patch... but the step doesn't say "uploading" when finished... there is a "return" artifact but it doesn't exist on S3 (our file server configuration)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

I'm working with the patch, and installing transformers from github

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

@<1523701118159294464:profile|ExasperatedCrab78>
Ok. bummer to hear that it won't be included automatically in the package.

I am now experiencing a bug with the patch, not sure it's to blame... but i'm unable to save models in the pipeline.. checking if it's related

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

It's been accepted in master, but was not released yet indeed!

As for the other issue, it seems like we won't be adding support for non-string dict keys anytime soon. I'm thinking of adding a specific example/tutorial on how to work with Huggingface + ClearML so people can do it themselves.

For now (using the patch) the only thing you need to be careful about is to not connect a dict or object with ints as keys. If you do need to (e.g. ususally huggingface models need the id2label dict somewhere) just make sure to cast it to string before connecting it to ClearML and casting it back to int directly after. So that when ClearML changes the value, it's properly taken care of 🙂 My previous sample code is still valid!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

@<1523701118159294464:profile|ExasperatedCrab78>
Hey again 🙂
I believe that the transformers patch wasn’t released yet right? we are getting into a problem where we need new features from transformers but can’t use because of this

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Saw it was merged 🙂 One down, one to go

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

@<1523701949617147904:profile|PricklyRaven28> Please use this patch instead of the one previously shared. It excludes the dict hack :)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

sounds good 🙂 I’ll soon check if this fixes our issue and update you

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

When creating it, I found that this hack should be on our side, not on Huggingface's. So I'm only going to fix issue 1 with the PR, issue 2 is ours 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

However, I actually do think I can already open the Huggingface PR in the meantime. It has actually relatively little to do with the second bug.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

It should, but please check first. This is some code I quickly made for myself. It did make tests for it, but it would be nice to hear from someone else that it worked (as evidenced by the error above 😅 )

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

that makes more sense 🙂
would this work now as a workaround until the version is released?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Damn it, you're right 😅

        # Allow ClearML access to the training args and allow it to override the arguments for remote execution
        args_class = type(training_args)
        args, changed_keys = cast_keys_to_string(training_args.to_dict())
        Task.current_task().connect(args)
        training_args = args_class(**cast_keys_back(args, changed_keys)[0])

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Hey 🙂 Thanks for the update!

what i’m missing the is the point where you report to clearml between cast and casting back 🤔

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Just for reference, the main issue is that ClearML does not allow non-string types as dict keys for its configuration. Usually the labeling mapping does have ints as keys. Which is why we need to cast them to strings first, then pass them to ClearML then cast them back.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Hey @<1523701949617147904:profile|PricklyRaven28> , So as discussed above there were 2 issues. The first one is still waiting on the second, it's on the backlog of our devs and should be done soon(tm).

That said, in the meantime I also wanted to do fun stuff with transformers, so I've written a quick hack that deals with the bug. It's bascially 2 functions that keep track of which types of keys are in the dict.

def cast_keys_to_string(d, changed_keys=dict()):
    nd = dict()
    for key in d.keys():
        if not isinstance(key, str):
            casted_key = str(key)
            changed_keys[casted_key] = key
        else:
            casted_key = key
        if isinstance(d[key], dict):
            nd[casted_key], changed_keys = cast_keys_to_string(d[key], changed_keys)
        else:
            nd[casted_key] = d[key]
    return nd, changed_keys

def cast_keys_back(d, changed_keys):
    nd = dict()
    for key in d.keys():
        if key in changed_keys:
            original_key = changed_keys[key]
        else:
            original_key = key
        if isinstance(d[key], dict):
            nd[original_key], changed_keys = cast_keys_back(d[key], changed_keys)
        else:
            nd[original_key] = d[key]
    return nd, changed_keys

You can then use them like this:

        training_args = TrainingArguments(
            output_dir="my_awesome_model",
            learning_rate=2e-5,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            dataloader_num_workers=0,
            num_train_epochs=2,
            weight_decay=0.01,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True
        )

        # Allow ClearML access to the training args and allow it to override the arguments for remote execution
        args_class = type(training_args)
        args, changed_keys = cast_keys_to_string(training_args.to_dict())
        training_args = args_class(**cast_keys_back(args, changed_keys)[0])

        self.trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=tokenized_dataset["train"],
            eval_dataset=tokenized_dataset["test"],
            tokenizer=self.tokenizer,
            data_collator=data_collator,
            compute_metrics=self.compute_metrics,
        )

        self.trainer.train()

This "hack" in combination with the patch to Huggingface from above should work 🙂 That said, it is a hack, so a production version of this should be there soon. I'll let you know when that happens!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

@<1523701118159294464:profile|ExasperatedCrab78>
Hey 🙂
Any updates on this? We need to use a new version of transformers because of another bug they have in an old version. so we can’t use the old transformers version anymore.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

@<1523701118159294464:profile|ExasperatedCrab78> Sorry only saw this now,
Thanks for checking it!
Glad to see you found the issue, hope you find a way to fix the second one. for now we will continue using the previous version.
Would be glad if you can post when everything is fixed so we can advance our version.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PricklyRaven28
				
					0
					 × 1

Show more results

Write your answer

145K Views

62 Answers

2 years ago