Answered

Hello Everyone. I'M Getting Started With Clearml. I'M Trying Hpo Atm And Have Successfully Run The Base Task. When Running The Clone Of The Base Task In One Of The Agents, I'M Getting Following Error. Any Suggestions? Tia

Hello Everyone. I'm getting started with clearml. I'm trying HPO atm and have successfully run the base task. When running the clone of the base task in one of the agents, I'm getting following error. Any suggestions? TIA

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

Votes Newest

Answers 29

this is on kubeflow pipelines

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

ok thanks

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

Okay great, so we do have the Args section there.
What do you have in the "Execution" tab?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

also hpo controller:
` import os
from clearml import Task

os.environ['MPLBACKEND'] = "TkAg"

CLEARML_PROJECT = "Vodafone Sentiment full"
CLEARML_TASK = "HPO optimizer Controller"
os.environ["CLEARML_PROJECT"] = CLEARML_PROJECT
os.environ["CLEARML_TASK"] = CLEARML_TASK

Task.set_credentials(
api_host=" ",
web_host=" ",
files_host=" ",
key='88888888888',
secret='888888888888888'
)

from clearml.automation import UniformParameterRange, UniformIntegerParameterRange, DiscreteParameterRange
from clearml.automation import HyperParameterOptimizer
from clearml.automation import GridSearch

from clearml import Task

task = Task.init(project_name=CLEARML_PROJECT,
task_name=CLEARML_TASK,
task_type=Task.TaskTypes.optimizer,
reuse_last_task_id=False)

optimizer = HyperParameterOptimizer(

specifying the task to be optimized, task must be in system already so it can be cloned

base_task_id=base_task,

setting the hyper-parameters to optimize

hyper_parameters=[
UniformIntegerParameterRange('General/epochs', min_value=2, max_value=12, step_size=5),
UniformParameterRange('General/lr', min_value=0.000001, max_value=0.0001, step_size=0.002),
],

setting the objective metric we want to maximize/minimize

objective_metric_title='f1',
objective_metric_series='eval',
objective_metric_sign='max',

setting optimizer

optimizer_class=GridSearch,

configuring optimization parameters

execution_queue='default',
max_number_of_concurrent_tasks=4,
optimization_time_limit=60.,
compute_time_limit=120,
total_max_jobs=20,
min_iteration_per_job=0,
max_iteration_per_job=15,
)

optimizer.set_report_period(1)

start the optimization process

this function returns immediately

optimizer.start()

set the time limit for the optimization process (2 hours)

optimizer.set_time_limit(in_minutes=120.0)

wait until process is done (notice we are controlling the optimization process in the background)

optimizer.wait()

optimization is completed, print the top performing experiments id

top_exp = optimizer.get_top_experiments(top_k=3)
print([t.id for t in top_exp])

make sure background optimization stopped

optimizer.stop() `

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

YummyFish22 can you point to the huggingface example you are using?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

nvm..I got this

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

The base task does have 'Task.init' and both have clearml installed

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

Notice the args will be set on the connect call, so the check on whether they are empty should come after

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I mean you can run it with kubeflow, but it kind of ruins the auto detection there
You can however clone and manually edit it back to your code, that would work

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

no..I don't provide any extra arguments

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

The base task is self-contained i.e. it downloads training/eval directly data and has direct access to it

I think this is the main issue, how come it does not catch it? Are you using argparser ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This part is odd:
SCRIPT PATH: tmp.7dSvBcyI7mHow did you end with this random filename? how are you running this code?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yes, that's the base task which ran successfully. It does have both

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

sorry, what do you mean :
manually edit it back to your code

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

When you are running the base-task, are you proving any arguments to it?
Can you share the "execution" Tab? and the Args tab of the base-task ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

ohhh yes and that is the issue 😞

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 even the base task does not have any Arg named "input_train_data". The base task is self-contained i.e. it downloads training/eval directly data and has direct access to it.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

Could it be the Args section of the task it clones does not have the "input_train_data" argument ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Now that I have shared this with you..I finally saw that kubeflow is injecting this argparse stuff

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

Sure,
https://github.com/allegroai/clearml/blob/a621b4fa20020d97e939f7c4b9f84a0a621d7d34/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py#L48

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi YummyFish22
Looks like the task does not have "Task.init" call on the main script (or an import of clearml)? could that be the case?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14
` import os

os.environ['LC_ALL'] = "C.UTF-8"
os.environ['LANG'] = "C.UTF-8"

from clearml import Task

CLEARML_PROJECT = 'Vodafone Sentiment full'
CLEARML_TASK = 'HPO_BASE_TASK'
os.environ["CLEARML_PROJECT"] = CLEARML_PROJECT
os.environ["CLEARML_TASK"] = CLEARML_TASK
os.environ['MPLBACKEND'] = "TkAg"

Task.set_credentials(
api_host=" ",
web_host=" ",
files_host=" ",
key='******************',
secret='*********************'
)

task = Task.init(project_name=CLEARML_PROJECT, task_name=CLEARML_TASK)

import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import Dataset,DataLoader
from transformers import TrainingArguments, Trainer,get_scheduler
from datasets import load_metric, load_dataset, Value, Dataset, load_from_disk
import re
import torch.nn as nn

from clearml import Dataset as ds
from datasets import load_from_disk
import os

artifact_dir = ds.get(dataset_name="vodafone_dataset_preprocessed_train_dataloader", dataset_project="vodafone dataset_full").get_local_copy()
print(os.listdir(artifact_dir))
train_dataset = load_from_disk(os.path.join(artifact_dir, 'data'))

artifact_dir = ds.get(dataset_name="vodafone_dataset_preprocessed_test_dataloader", dataset_project="vodafone dataset_full").get_local_copy()
print(os.listdir(artifact_dir))
eval_dataset = load_from_disk(os.path.join(artifact_dir, 'data'))

print(input_train_data, input_test_data)
print(train_dataset.shape)
print(eval_dataset.shape)

example_configuration = {
'epochs': 5,
'lr': 0.00001,
'optimizer': 'adam'
}

task.connect(example_configuration)

model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-xlm-roberta-base-sentiment")
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-xlm-roberta-base-sentiment")

metric = load_metric('f1')

def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return metric.compute(predictions=predictions, references=labels , average='micro')

torch.cuda.empty_cache()

training_args = TrainingArguments(
num_train_epochs=example_configuration['epochs'],
logging_steps=5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
evaluation_strategy="epoch",
learning_rate=example_configuration['lr'],
save_steps = 250,
save_total_limit = 2,
output_dir = 'output/',
report_to = 'clearml',
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)

trainer.train() `

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

Thanks for your answer. By main script, you mean the base task or the agent?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

no..I'm not.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

Thanks!
fyi: This section is not necessary if you you have clearml.conf file in ~/
Task.set_credentials( api_host=" ", web_host=" ", files_host=" ", key='********************', secret='***********************' )Let me check the code for a min

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

could you maybe point me to an example of HPO that uses transformers? Can 't find anything online. maybe I can compare my version. Many thanks

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

AgitatedDove14 I followed the above format but it still does not work. i'm getting increasingly sure that this is related to huggingface's trainer API. Can you share an example for using huggingface's trainer API if possible? TIA

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					YummyFish22
				
					0
					 × 1

The one it is trying to execute, i.e. on the Task it shows as Script Path

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

29 Answers

2 years ago