Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All Im Trying To Save My Model Checkpoints During Runtime But Am Running Into A Confusing Snag. I'M Using The Huggingface Architecture For A Transformer. Using Their Training Module To Control Training. In The Training Args, I Have The

Hi all

Im trying to save my model checkpoints during runtime but am running into a confusing snag.
I'm using the HuggingFace architecture for a transformer. Using their training module to control training. In the training args, i have the ouput_dir set to somefolder during runtime i can see that the checkpoint files are being save to somefolder/checkpoint_** as expected.

the output_uri for clearml is set to some location on the fileserver and indeed artefacts are being saved there. This is the confusing part. artefacts are being saved in sometrainingrun/model/ on the file server. no sign of the checkpoint folder. some of the .pt files are in that resulting folder along with the training args but other expected contents including the model.safetensor file and jsons are missing. Final confusing point which isnt clear is if clearml is saving artefacts from the final checkpoint or the best checkpoint....

thanks in advance

  
  
Posted 3 months ago
Votes Newest

Answers 17


it would seem they are related but i cant see the further details of this bug. Either doing a manual artefact upload with task or turning tensor board tracking off in the hugging face trainer both seemed to enable json tracking within the checkpoints. But I would have thought the tensorboard behavior wasnt desired.

  
  
Posted 3 months ago

called with:


task = Task.init(
    project_name=project_name, task_name=task_name, output_uri="fileserver_address"
)


task.connect(config)


checkpoint = config.get("model_path")

image_processor = AutoImageProcessor.from_pretrained(
        checkpoint,
        num_labels=config.get("class_number"),
    )

best_model = training(checkpoint, image_processor)
  
  
Posted 3 months ago

you can see here its missing some files

  
  
Posted 3 months ago

need to work out if i need to reconfigure something and re-train or are my files (actual model tensor) recoverable

  
  
Posted 3 months ago

default_outputdir in the conf is set to the filerserver address same as above pointing

  
  
Posted 3 months ago

training function:

def training(checkpoint, image_processor):

    data_test_train, labels, label_to_id, id_to_label = pre_process()

    model = AutoModelForImageClassification.from_pretrained(
        checkpoint,
        num_labels=len(labels),
        id2label=id_to_label,
        label2id=label_to_id,
        ),
    )
    def metrics(eval_pred):
        metric_val = config.get("eval_metric")
        metric = evaluate.load(metric_val)
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        if metric_val == "accuracy":
            return metric.compute(predictions=predictions, references=labels)
        else:
            return metric.compute(
                predictions=predictions, references=labels, average="weighted"
            )

    data_collator = DefaultDataCollator()

    training_args = TrainingArguments(
        output_dir="somefolder",
        remove_unused_columns=False,
        eval_strategy="epoch",
        save_strategy="epoch",
        learning_rate=config.get("learning_rate"),
        per_device_train_batch_size=config.get("train_batch_size"),
        gradient_accumulation_steps=config.get("gradient_steps"),
        per_device_eval_batch_size=config.get("eval_batch_size"),
        num_train_epochs=config.get("epochs"),
        warmup_ratio=config.get("warmup_ratio"),
        logging_steps=config.get("logging_steps"),
        save_total_limit=config.get("save_total"),
        load_best_model_at_end=True,
        report_to="tensorboard",
        metric_for_best_model=config.get("eval_metric"),
        push_to_hub=False,
    )
    trainer = Trainer(
        model=model.to(device),
        args=training_args,
        data_collator=data_collator,
        train_dataset=data_test_train["train"],
        eval_dataset=data_test_train["test"],
        tokenizer=image_processor,
        compute_metrics=metrics,
    )
    print("training")
    trainer.train()
    print("save_model")
    trainer.save_model()
    best_model = trainer.state.best_model_checkpoint
    print(best_model)
    classifification_report(data_test_train["test"], model, best_model)
    return best_model
  
  
Posted 3 months ago

Try save_safetensors=False in TrainingArguments . Not sure if clearML supports safetensors

  
  
Posted 3 months ago

further context. it saves the optimizer.pt , scheduler.pt , rng_state_path, and training_args.bin. but i cant locate the model.safetensors or meta jsons

  
  
Posted 3 months ago

Hi @<1730033904972206080:profile|FantasticSeaurchin8>
Is this only relates to this
https://github.com/coqui-ai/Trainer/issues/7
Or is it a clearml sdk issue?

  
  
Posted 3 months ago

Hi @<1730033904972206080:profile|FantasticSeaurchin8> , can you provide a sample script that reproduces this behaviour?

  
  
Posted 3 months ago

So we have managed to get whole checkpoint files to save by removing the save_total_limit from training, this seems to save checkpoint folders with all files in it. however now we have a ballooning server.

did discover this None
and wondering if there's some nuance in autotracking that needs to be circumvented

  
  
Posted 3 months ago

But that doesn't explain why the model JSON files are missing.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas? Thank you

  
  
Posted 3 months ago

None ?

  
  
Posted 3 months ago

console output:

clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/optimizer.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/scheduler.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/rng_state.pth
save_model
somemodel/checkpoint-198
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/training_args.bin
  
  
Posted 3 months ago

hmm i can probably provide snippets

  
  
Posted 3 months ago

queued with:

task = Task.create(
    project_name="name",
    task_name="training",
    repo="repo",
    branch="branch",
    script="training_script",
    packages=package_list,
    docker="docker_gpu_image",
    docker_args=["--network=host"],
)
task.output_uri = "filer_server"
task.enqueue(task, "training")
  
  
Posted 3 months ago

so turning report_to="tensorboard", off seemed to solve the issue...as in the training run saves checkpoints as you would expect. That doesnt seem like desired behavior..

  
  
Posted 3 months ago
234 Views
17 Answers
3 months ago
3 months ago
Tags