Hi All Im Trying To Save My Model Checkpoints During Runtime But Am Running Into A Confusing Snag. I'M Using The Huggingface Architecture For A Transformer. Using Their Training Module To Control Training. In The Training Args, I Have The

Answered

Hi all

Im trying to save my model checkpoints during runtime but am running into a confusing snag.
I'm using the HuggingFace architecture for a transformer. Using their training module to control training. In the training args, i have the ouput_dir set to somefolder during runtime i can see that the checkpoint files are being save to somefolder/checkpoint_** as expected.

the output_uri for clearml is set to some location on the fileserver and indeed artefacts are being saved there. This is the confusing part. artefacts are being saved in sometrainingrun/model/ on the file server. no sign of the checkpoint folder. some of the .pt files are in that resulting folder along with the training args but other expected contents including the model.safetensor file and jsons are missing. Final confusing point which isnt clear is if clearml is saving artefacts from the final checkpoint or the best checkpoint....

thanks in advance

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

Votes Newest

Answers 17

console output:

clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/optimizer.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/scheduler.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/rng_state.pth
save_model
somemodel/checkpoint-198
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/training_args.bin

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

need to work out if i need to reconfigure something and re-train or are my files (actual model tensor) recoverable

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

queued with:

task = Task.create(
    project_name="name",
    task_name="training",
    repo="repo",
    branch="branch",
    script="training_script",
    packages=package_list,
    docker="docker_gpu_image",
    docker_args=["--network=host"],
)
task.output_uri = "filer_server"
task.enqueue(task, "training")

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

you can see here its missing some files

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

hmm i can probably provide snippets

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

further context. it saves the optimizer.pt , scheduler.pt , rng_state_path, and training_args.bin. but i cant locate the model.safetensors or meta jsons

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

default_outputdir in the conf is set to the filerserver address same as above pointing

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

Hi @<1730033904972206080:profile|FantasticSeaurchin8> , can you provide a sample script that reproduces this behaviour?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

So we have managed to get whole checkpoint files to save by removing the save_total_limit from training, this seems to save checkpoint folders with all files in it. however now we have a ballooning server.

did discover this None
and wondering if there's some nuance in autotracking that needs to be circumvented

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

None ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

it would seem they are related but i cant see the further details of this bug. Either doing a manual artefact upload with task or turning tensor board tracking off in the hugging face trainer both seemed to enable json tracking within the checkpoints. But I would have thought the tensorboard behavior wasnt desired.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

called with:


task = Task.init(
    project_name=project_name, task_name=task_name, output_uri="fileserver_address"
)


task.connect(config)


checkpoint = config.get("model_path")

image_processor = AutoImageProcessor.from_pretrained(
        checkpoint,
        num_labels=config.get("class_number"),
    )

best_model = training(checkpoint, image_processor)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

so turning report_to="tensorboard", off seemed to solve the issue...as in the training run saves checkpoints as you would expect. That doesnt seem like desired behavior..

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

Hi @<1730033904972206080:profile|FantasticSeaurchin8>
Is this only relates to this
https://github.com/coqui-ai/Trainer/issues/7
Or is it a clearml sdk issue?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

training function:

def training(checkpoint, image_processor):

    data_test_train, labels, label_to_id, id_to_label = pre_process()

    model = AutoModelForImageClassification.from_pretrained(
        checkpoint,
        num_labels=len(labels),
        id2label=id_to_label,
        label2id=label_to_id,
        ),
    )
    def metrics(eval_pred):
        metric_val = config.get("eval_metric")
        metric = evaluate.load(metric_val)
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        if metric_val == "accuracy":
            return metric.compute(predictions=predictions, references=labels)
        else:
            return metric.compute(
                predictions=predictions, references=labels, average="weighted"
            )

    data_collator = DefaultDataCollator()

    training_args = TrainingArguments(
        output_dir="somefolder",
        remove_unused_columns=False,
        eval_strategy="epoch",
        save_strategy="epoch",
        learning_rate=config.get("learning_rate"),
        per_device_train_batch_size=config.get("train_batch_size"),
        gradient_accumulation_steps=config.get("gradient_steps"),
        per_device_eval_batch_size=config.get("eval_batch_size"),
        num_train_epochs=config.get("epochs"),
        warmup_ratio=config.get("warmup_ratio"),
        logging_steps=config.get("logging_steps"),
        save_total_limit=config.get("save_total"),
        load_best_model_at_end=True,
        report_to="tensorboard",
        metric_for_best_model=config.get("eval_metric"),
        push_to_hub=False,
    )
    trainer = Trainer(
        model=model.to(device),
        args=training_args,
        data_collator=data_collator,
        train_dataset=data_test_train["train"],
        eval_dataset=data_test_train["test"],
        tokenizer=image_processor,
        compute_metrics=metrics,
    )
    print("training")
    trainer.train()
    print("save_model")
    trainer.save_model()
    best_model = trainer.state.best_model_checkpoint
    print(best_model)
    classifification_report(data_test_train["test"], model, best_model)
    return best_model

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

But that doesn't explain why the model JSON files are missing.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas? Thank you

  				
Posted 
	one year ago

					More
				  		
  Report
		
					RattyBluewhale45
				
					0
					 × 1

Try save_safetensors=False in TrainingArguments . Not sure if clearML supports safetensors

  				
Posted 
	one year ago

					More
				  		
  Report
		
					RattyBluewhale45
				
					0
					 × 1

Write your answer

1K Views

17 Answers

one year ago