When you are running the base-task, are you proving any arguments to it?
Can you share the "execution" Tab? and the Args tab of the base-task ?
could you maybe point me to an example of HPO that uses transformers? Can 't find anything online. maybe I can compare my version. Many thanks
Thanks!
fyi: This section is not necessary if you you have clearml.conf file in ~/Task.set_credentials( api_host="
", web_host="
", files_host="
", key='********************', secret='***********************' )
Let me check the code for a min
AgitatedDove14
` import os
os.environ['LC_ALL'] = "C.UTF-8"
os.environ['LANG'] = "C.UTF-8"
from clearml import Task
CLEARML_PROJECT = 'Vodafone Sentiment full'
CLEARML_TASK = 'HPO_BASE_TASK'
os.environ["CLEARML_PROJECT"] = CLEARML_PROJECT
os.environ["CLEARML_TASK"] = CLEARML_TASK
os.environ['MPLBACKEND'] = "TkAg"
Task.set_credentials(
api_host=" ",
web_host=" ",
files_host=" ",
key='******************',
secret='*********************'
)
task = Task.init(project_name=CLEARML_PROJECT, task_name=CLEARML_TASK)
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import Dataset,DataLoader
from transformers import TrainingArguments, Trainer,get_scheduler
from datasets import load_metric, load_dataset, Value, Dataset, load_from_disk
import re
import torch.nn as nn
from clearml import Dataset as ds
from datasets import load_from_disk
import os
artifact_dir = ds.get(dataset_name="vodafone_dataset_preprocessed_train_dataloader", dataset_project="vodafone dataset_full").get_local_copy()
print(os.listdir(artifact_dir))
train_dataset = load_from_disk(os.path.join(artifact_dir, 'data'))
artifact_dir = ds.get(dataset_name="vodafone_dataset_preprocessed_test_dataloader", dataset_project="vodafone dataset_full").get_local_copy()
print(os.listdir(artifact_dir))
eval_dataset = load_from_disk(os.path.join(artifact_dir, 'data'))
print(input_train_data, input_test_data)
print(train_dataset.shape)
print(eval_dataset.shape)
example_configuration = {
'epochs': 5,
'lr': 0.00001,
'optimizer': 'adam'
}
task.connect(example_configuration)
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-xlm-roberta-base-sentiment")
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-xlm-roberta-base-sentiment")
metric = load_metric('f1')
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return metric.compute(predictions=predictions, references=labels , average='micro')
torch.cuda.empty_cache()
training_args = TrainingArguments(
num_train_epochs=example_configuration['epochs'],
logging_steps=5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
evaluation_strategy="epoch",
learning_rate=example_configuration['lr'],
save_steps = 250,
save_total_limit = 2,
output_dir = 'output/',
report_to = 'clearml',
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train() `
sorry, what do you mean :manually edit it back to your code
I mean you can run it with kubeflow, but it kind of ruins the auto detection there
You can however clone and manually edit it back to your code, that would work
AgitatedDove14 I followed the above format but it still does not work. i'm getting increasingly sure that this is related to huggingface's trainer API. Can you share an example for using huggingface's trainer API if possible? TIA
Okay great, so we do have the Args section there.
What do you have in the "Execution" tab?
The base task does have 'Task.init' and both have clearml installed
yes, that's the base task which ran successfully. It does have both
YummyFish22 can you point to the huggingface example you are using?
Could it be the Args section of the task it clones does not have the "input_train_data" argument ?
Now that I have shared this with you..I finally saw that kubeflow is injecting this argparse stuff
Notice the args will be set on the connect
call, so the check on whether they are empty should come after
AgitatedDove14 even the base task does not have any Arg named "input_train_data". The base task is self-contained i.e. it downloads training/eval directly data and has direct access to it.
The base task is self-contained i.e. it downloads training/eval directly data and has direct access to it
I think this is the main issue, how come it does not catch it? Are you using argparser ?
The one it is trying to execute, i.e. on the Task it shows as Script Path
Hi YummyFish22
Looks like the task does not have "Task.init" call on the main script (or an import of clearml)? could that be the case?
Thanks for your answer. By main script, you mean the base task or the agent?
also hpo controller:
` import os
from clearml import Task
os.environ['MPLBACKEND'] = "TkAg"
CLEARML_PROJECT = "Vodafone Sentiment full"
CLEARML_TASK = "HPO optimizer Controller"
os.environ["CLEARML_PROJECT"] = CLEARML_PROJECT
os.environ["CLEARML_TASK"] = CLEARML_TASK
Task.set_credentials(
api_host=" ",
web_host=" ",
files_host=" ",
key='88888888888',
secret='888888888888888'
)
from clearml.automation import UniformParameterRange, UniformIntegerParameterRange, DiscreteParameterRange
from clearml.automation import HyperParameterOptimizer
from clearml.automation import GridSearch
from clearml import Task
task = Task.init(project_name=CLEARML_PROJECT,
task_name=CLEARML_TASK,
task_type=Task.TaskTypes.optimizer,
reuse_last_task_id=False)
optimizer = HyperParameterOptimizer(
specifying the task to be optimized, task must be in system already so it can be cloned
base_task_id=base_task,
setting the hyper-parameters to optimize
hyper_parameters=[
UniformIntegerParameterRange('General/epochs', min_value=2, max_value=12, step_size=5),
UniformParameterRange('General/lr', min_value=0.000001, max_value=0.0001, step_size=0.002),
],
setting the objective metric we want to maximize/minimize
objective_metric_title='f1',
objective_metric_series='eval',
objective_metric_sign='max',
setting optimizer
optimizer_class=GridSearch,
configuring optimization parameters
execution_queue='default',
max_number_of_concurrent_tasks=4,
optimization_time_limit=60.,
compute_time_limit=120,
total_max_jobs=20,
min_iteration_per_job=0,
max_iteration_per_job=15,
)
optimizer.set_report_period(1)
start the optimization process
this function returns immediately
optimizer.start()
set the time limit for the optimization process (2 hours)
optimizer.set_time_limit(in_minutes=120.0)
wait until process is done (notice we are controlling the optimization process in the background)
optimizer.wait()
optimization is completed, print the top performing experiments id
top_exp = optimizer.get_top_experiments(top_k=3)
print([t.id for t in top_exp])
make sure background optimization stopped
optimizer.stop() `
This part is odd:SCRIPT PATH: tmp.7dSvBcyI7m
How did you end with this random filename? how are you running this code?