it would seem they are related but i cant see the further details of this bug. Either doing a manual artefact upload with task or turning tensor board tracking off in the hugging face trainer both seemed to enable json tracking within the checkpoints. But I would have thought the tensorboard behavior wasnt desired.
Hi @<1730033904972206080:profile|FantasticSeaurchin8>
Is this only relates to this
https://github.com/coqui-ai/Trainer/issues/7
Or is it a clearml sdk issue?
so turning report_to="tensorboard",
off seemed to solve the issue...as in the training run saves checkpoints as you would expect. That doesnt seem like desired behavior..
So we have managed to get whole checkpoint files to save by removing the save_total_limit
from training, this seems to save checkpoint folders with all files in it. however now we have a ballooning server.
did discover this None
and wondering if there's some nuance in autotracking that needs to be circumvented
But that doesn't explain why the model JSON files are missing.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas? Thank you
Try save_safetensors=False
in TrainingArguments
. Not sure if clearML supports safetensors
you can see here its missing some files
console output:
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/optimizer.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/scheduler.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/rng_state.pth
save_model
somemodel/checkpoint-198
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/training_args.bin
default_outputdir in the conf is set to the filerserver address same as above pointing
queued with:
task = Task.create(
project_name="name",
task_name="training",
repo="repo",
branch="branch",
script="training_script",
packages=package_list,
docker="docker_gpu_image",
docker_args=["--network=host"],
)
task.output_uri = "filer_server"
task.enqueue(task, "training")
called with:
task = Task.init(
project_name=project_name, task_name=task_name, output_uri="fileserver_address"
)
task.connect(config)
checkpoint = config.get("model_path")
image_processor = AutoImageProcessor.from_pretrained(
checkpoint,
num_labels=config.get("class_number"),
)
best_model = training(checkpoint, image_processor)
training function:
def training(checkpoint, image_processor):
data_test_train, labels, label_to_id, id_to_label = pre_process()
model = AutoModelForImageClassification.from_pretrained(
checkpoint,
num_labels=len(labels),
id2label=id_to_label,
label2id=label_to_id,
),
)
def metrics(eval_pred):
metric_val = config.get("eval_metric")
metric = evaluate.load(metric_val)
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
if metric_val == "accuracy":
return metric.compute(predictions=predictions, references=labels)
else:
return metric.compute(
predictions=predictions, references=labels, average="weighted"
)
data_collator = DefaultDataCollator()
training_args = TrainingArguments(
output_dir="somefolder",
remove_unused_columns=False,
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=config.get("learning_rate"),
per_device_train_batch_size=config.get("train_batch_size"),
gradient_accumulation_steps=config.get("gradient_steps"),
per_device_eval_batch_size=config.get("eval_batch_size"),
num_train_epochs=config.get("epochs"),
warmup_ratio=config.get("warmup_ratio"),
logging_steps=config.get("logging_steps"),
save_total_limit=config.get("save_total"),
load_best_model_at_end=True,
report_to="tensorboard",
metric_for_best_model=config.get("eval_metric"),
push_to_hub=False,
)
trainer = Trainer(
model=model.to(device),
args=training_args,
data_collator=data_collator,
train_dataset=data_test_train["train"],
eval_dataset=data_test_train["test"],
tokenizer=image_processor,
compute_metrics=metrics,
)
print("training")
trainer.train()
print("save_model")
trainer.save_model()
best_model = trainer.state.best_model_checkpoint
print(best_model)
classifification_report(data_test_train["test"], model, best_model)
return best_model
hmm i can probably provide snippets
Hi @<1730033904972206080:profile|FantasticSeaurchin8> , can you provide a sample script that reproduces this behaviour?
further context. it saves the optimizer.pt , scheduler.pt , rng_state_path, and training_args.bin. but i cant locate the model.safetensors or meta jsons
need to work out if i need to reconfigure something and re-train or are my files (actual model tensor) recoverable