called with:
task = Task.init(
project_name=project_name, task_name=task_name, output_uri="fileserver_address"
)
task.connect(config)
checkpoint = config.get("model_path")
image_processor = AutoImageProcessor.from_pretrained(
checkpoint,
num_labels=config.get("class_number"),
)
best_model = training(checkpoint, image_processor)
default_outputdir in the conf is set to the filerserver address same as above pointing
need to work out if i need to reconfigure something and re-train or are my files (actual model tensor) recoverable
So we have managed to get whole checkpoint files to save by removing the save_total_limit
from training, this seems to save checkpoint folders with all files in it. however now we have a ballooning server.
did discover this None
and wondering if there's some nuance in autotracking that needs to be circumvented
further context. it saves the optimizer.pt , scheduler.pt , rng_state_path, and training_args.bin. but i cant locate the model.safetensors or meta jsons
so turning report_to="tensorboard",
off seemed to solve the issue...as in the training run saves checkpoints as you would expect. That doesnt seem like desired behavior..
But that doesn't explain why the model JSON files are missing.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas? Thank you
console output:
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/optimizer.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/scheduler.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/rng_state.pth
save_model
somemodel/checkpoint-198
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/training_args.bin
hmm i can probably provide snippets
queued with:
task = Task.create(
project_name="name",
task_name="training",
repo="repo",
branch="branch",
script="training_script",
packages=package_list,
docker="docker_gpu_image",
docker_args=["--network=host"],
)
task.output_uri = "filer_server"
task.enqueue(task, "training")
you can see here its missing some files
Hi @<1730033904972206080:profile|FantasticSeaurchin8> , can you provide a sample script that reproduces this behaviour?
it would seem they are related but i cant see the further details of this bug. Either doing a manual artefact upload with task or turning tensor board tracking off in the hugging face trainer both seemed to enable json tracking within the checkpoints. But I would have thought the tensorboard behavior wasnt desired.
Try save_safetensors=False
in TrainingArguments
. Not sure if clearML supports safetensors
training function:
def training(checkpoint, image_processor):
data_test_train, labels, label_to_id, id_to_label = pre_process()
model = AutoModelForImageClassification.from_pretrained(
checkpoint,
num_labels=len(labels),
id2label=id_to_label,
label2id=label_to_id,
),
)
def metrics(eval_pred):
metric_val = config.get("eval_metric")
metric = evaluate.load(metric_val)
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
if metric_val == "accuracy":
return metric.compute(predictions=predictions, references=labels)
else:
return metric.compute(
predictions=predictions, references=labels, average="weighted"
)
data_collator = DefaultDataCollator()
training_args = TrainingArguments(
output_dir="somefolder",
remove_unused_columns=False,
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=config.get("learning_rate"),
per_device_train_batch_size=config.get("train_batch_size"),
gradient_accumulation_steps=config.get("gradient_steps"),
per_device_eval_batch_size=config.get("eval_batch_size"),
num_train_epochs=config.get("epochs"),
warmup_ratio=config.get("warmup_ratio"),
logging_steps=config.get("logging_steps"),
save_total_limit=config.get("save_total"),
load_best_model_at_end=True,
report_to="tensorboard",
metric_for_best_model=config.get("eval_metric"),
push_to_hub=False,
)
trainer = Trainer(
model=model.to(device),
args=training_args,
data_collator=data_collator,
train_dataset=data_test_train["train"],
eval_dataset=data_test_train["test"],
tokenizer=image_processor,
compute_metrics=metrics,
)
print("training")
trainer.train()
print("save_model")
trainer.save_model()
best_model = trainer.state.best_model_checkpoint
print(best_model)
classifification_report(data_test_train["test"], model, best_model)
return best_model
Hi @<1730033904972206080:profile|FantasticSeaurchin8>
Is this only relates to this
https://github.com/coqui-ai/Trainer/issues/7
Or is it a clearml sdk issue?