it would seem they are related but i cant see the further details of this bug. Either doing a manual artefact upload with task or turning tensor board tracking off in the hugging face trainer both seemed to enable json tracking within the checkpoints. But I would have thought the tensorboard behavior wasnt desired.
called with:
task = Task.init(
project_name=project_name, task_name=task_name, output_uri="fileserver_address"
)
task.connect(config)
checkpoint = config.get("model_path")
image_processor = AutoImageProcessor.from_pretrained(
checkpoint,
num_labels=config.get("class_number"),
)
best_model = training(checkpoint, image_processor)
you can see here its missing some files
need to work out if i need to reconfigure something and re-train or are my files (actual model tensor) recoverable
default_outputdir in the conf is set to the filerserver address same as above pointing
training function:
def training(checkpoint, image_processor):
data_test_train, labels, label_to_id, id_to_label = pre_process()
model = AutoModelForImageClassification.from_pretrained(
checkpoint,
num_labels=len(labels),
id2label=id_to_label,
label2id=label_to_id,
),
)
def metrics(eval_pred):
metric_val = config.get("eval_metric")
metric = evaluate.load(metric_val)
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
if metric_val == "accuracy":
return metric.compute(predictions=predictions, references=labels)
else:
return metric.compute(
predictions=predictions, references=labels, average="weighted"
)
data_collator = DefaultDataCollator()
training_args = TrainingArguments(
output_dir="somefolder",
remove_unused_columns=False,
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=config.get("learning_rate"),
per_device_train_batch_size=config.get("train_batch_size"),
gradient_accumulation_steps=config.get("gradient_steps"),
per_device_eval_batch_size=config.get("eval_batch_size"),
num_train_epochs=config.get("epochs"),
warmup_ratio=config.get("warmup_ratio"),
logging_steps=config.get("logging_steps"),
save_total_limit=config.get("save_total"),
load_best_model_at_end=True,
report_to="tensorboard",
metric_for_best_model=config.get("eval_metric"),
push_to_hub=False,
)
trainer = Trainer(
model=model.to(device),
args=training_args,
data_collator=data_collator,
train_dataset=data_test_train["train"],
eval_dataset=data_test_train["test"],
tokenizer=image_processor,
compute_metrics=metrics,
)
print("training")
trainer.train()
print("save_model")
trainer.save_model()
best_model = trainer.state.best_model_checkpoint
print(best_model)
classifification_report(data_test_train["test"], model, best_model)
return best_model
Try save_safetensors=False
in TrainingArguments
. Not sure if clearML supports safetensors
further context. it saves the optimizer.pt , scheduler.pt , rng_state_path, and training_args.bin. but i cant locate the model.safetensors or meta jsons
Hi @<1730033904972206080:profile|FantasticSeaurchin8>
Is this only relates to this
https://github.com/coqui-ai/Trainer/issues/7
Or is it a clearml sdk issue?
Hi @<1730033904972206080:profile|FantasticSeaurchin8> , can you provide a sample script that reproduces this behaviour?
So we have managed to get whole checkpoint files to save by removing the save_total_limit
from training, this seems to save checkpoint folders with all files in it. however now we have a ballooning server.
did discover this None
and wondering if there's some nuance in autotracking that needs to be circumvented
But that doesn't explain why the model JSON files are missing.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas? Thank you
console output:
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/optimizer.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/scheduler.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/rng_state.pth
save_model
somemodel/checkpoint-198
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/training_args.bin
hmm i can probably provide snippets
queued with:
task = Task.create(
project_name="name",
task_name="training",
repo="repo",
branch="branch",
script="training_script",
packages=package_list,
docker="docker_gpu_image",
docker_args=["--network=host"],
)
task.output_uri = "filer_server"
task.enqueue(task, "training")
so turning report_to="tensorboard",
off seemed to solve the issue...as in the training run saves checkpoints as you would expect. That doesnt seem like desired behavior..