Hi @<1523701118159294464:profile|ExasperatedCrab78> ,
It worked after installing latest Huggingface Transformer from github main branch. Thank you so very much for your support.
Ah I see 😄 I have submitted a ClearML patch to Huggingface transformers: None
It is merged, but not in a release yet. Would you mind checking if it works if you install transformers from github? (aka the latest master version)
I still have my tasks I ran remotely and they don't show any uncommitted changes. @<1540142651142049792:profile|BurlyHorse22> are you sure the remote machine is running transformers from the latest github branch, instead of from the package?
If it all looks fine, can you please install transformers from this repo (branch main) and rerun? It might be that not all my fixes came through
Hi @<1540142651142049792:profile|BurlyHorse22> , it looks like an error in your code that is bringing the traceback. What is happening during the traceback?
Hi @<1523701118159294464:profile|ExasperatedCrab78> ,
I flagged some examples and created a new dataset. And the cloned the DistilBert Training task and then Enqueued for running in an agent. But it failed with the below error
{'eval_loss': 0.6758520603179932, 'eval_accuracy': 0.5912839158071777, 'eval_runtime': 232.0297, 'eval_samples_per_second': 871.246, 'eval_steps_per_second': 54.454, 'epoch': 1.0}
50%|█████ | 63/126 [03:58<00:04, 14.88it/s] Traceback (most recent call last):
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/task_repository/sarcasm_detector.git/train_transformer.py", line 141, in <module>
sarcasm_trainer.train()
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/task_repository/sarcasm_detector.git/train_transformer.py", line 134, in train
self.trainer.train()
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1631, in train
return inner_training_loop(
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1990, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2293, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2771, in save_model
self._save(output_dir)
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2823, in _save
self.model.save_pretrained(output_dir, state_dict=state_dict)
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1708, in save_pretrained
model_to_save.config.save_pretrained(save_directory)
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/lib/python3.10/site-packages/transformers/configuration_utils.py", line 456, in save_pretrained
self.to_json_file(output_config_file, use_diff=True)
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/lib/python3.10/site-packages/transformers/configuration_utils.py", line 838, in to_json_file
writer.write(self.to_json_string(use_diff=use_diff))
File "/home/ubuntu/.clearml/venvs-builds.2/3.10/lib/python3.10/site-packages/transformers/configuration_utils.py", line 824, in to_json_string
return json.dumps(config_dict, indent=2, sort_keys=True) + "\n"
File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps
**kw).encode(obj)
File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 353, in _iterencode_dict
items = sorted(dct.items())
TypeError: '<' not supported between instances of 'str' and 'int'
Hi @<1523701070390366208:profile|CostlyOstrich36> ,
The same code runs fine when I am running it directly from VS Code. But when tried to clone & enqueue the task on an agent this error occurs. My agent is running inside an EC2 instance with GPU. I am using the same python virtual environment when running from VSCode & also while running the agent.
I am not aware of any way to debug my code when I clone & enqueue a task that runs on an agent.
@<1523701118159294464:profile|ExasperatedCrab78> The dataset loading issue is not coming up as I have started using the data shared in the github repo- Thanks a lot for the quick response.
But now I am facing a different issue, Now there is a conflict in creating Clearml Task,
Current task already created and requested project name '
HuggingFace Transformers
' does not match current project name 'sarcasm_detector'. If you wish to create additional tasks use
Task.create, or close the current task with
task.close()before calling
Task.init(...)``
Note: I do not see the Project Name "HuggingFace Transformers" mentioned anywhere in the code too.
Hi @<1540142651142049792:profile|BurlyHorse22> I think I know what is happening. So, ClearML does not support having dict keys by any other type than string. This is why I made these functions to cast the dict keys to string and back after we connect them to clearml.
What happens I think is that id2label is a dict with ints as keys and it is not cast into string before being given to the model which in turn will be connected by the internal Huggingface integration to ClearML.
I'm checking now what I did about it in my branch, it seems maybe not everything was pushed yet!
@<1540142651142049792:profile|BurlyHorse22> do you mean the one refereed in the video ? (I think this is the raw data in kaggle)
Yes the one reffered in video. But @<1523701118159294464:profile|ExasperatedCrab78> had mentioned (at 3.45 minute of YouTube video) that he was using it after some preprocessing. The raw data from Kaggle is not not gettting loaded using huggingface load_dataset() function. Please find the screenshot of the error while running train_sklearn.py and train_transformer.py. So, I am assuming it will work if I get the preprocessed data.
Great to hear! Then it comes down to waiting for the next hugging release!