Reputation
Badges 1
25 × Eureka!So we have managed to get whole checkpoint files to save by removing the save_total_limit
from training, this seems to save checkpoint folders with all files in it. however now we have a ballooning server.
did discover this None
and wondering if there's some nuance in autotracking that needs to be circumvented
it would seem they are related but i cant see the further details of this bug. Either doing a manual artefact upload with task or turning tensor board tracking off in the hugging face trainer both seemed to enable json tracking within the checkpoints. But I would have thought the tensorboard behavior wasnt desired.
called with:
task = Task.init(
project_name=project_name, task_name=task_name, output_uri="fileserver_address"
)
task.connect(config)
checkpoint = config.get("model_path")
image_processor = AutoImageProcessor.from_pretrained(
checkpoint,
num_labels=config.get("class_number"),
)
best_model = training(checkpoint, image_processor)
UI, in the dashboard. I know I could create my own custom plot and track it but it seems odd not to have epoch as configurable option
so turning report_to="tensorboard",
off seemed to solve the issue...as in the training run saves checkpoints as you would expect. That doesnt seem like desired behavior..
need to work out if i need to reconfigure something and re-train or are my files (actual model tensor) recoverable
console output:
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/optimizer.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/scheduler.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/rng_state.pth
save_model
somemodel/checkpoint-198
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/training_args.bin
This issue was solved by adding task.output_uri = "fileserver" in the scheduling script, but for some reason this does not work when setting in task.create call in the same script but needs to be set after. it also doesnt work when being set in the training script, so there must have been some unknown overriding
queued with:
task = Task.create(
project_name="name",
task_name="training",
repo="repo",
branch="branch",
script="training_script",
packages=package_list,
docker="docker_gpu_image",
docker_args=["--network=host"],
)
task.output_uri = "filer_server"
task.enqueue(task, "training")
default_outputdir in the conf is set to the filerserver address same as above pointing
training function:
def training(checkpoint, image_processor):
data_test_train, labels, label_to_id, id_to_label = pre_process()
model = AutoModelForImageClassification.from_pretrained(
checkpoint,
num_labels=len(labels),
id2label=id_to_label,
label2id=label_to_id,
),
)
def metrics(eval_pred):
metric_val = config.get("eval_metric")
metric = evaluate.load(metric_val)
predictions, labels = eval_pred
...
it is reachable to get data for training as image data is saved on the file server
further context. it saves the optimizer.pt , scheduler.pt , rng_state_path, and training_args.bin. but i cant locate the model.safetensors or meta jsons
I ran into trouble with this, i found for saving data you need to have it specified in the conf. even though as far as im aware setting it as par tof a task is supposed to overwrite this.
further i found that the server wasnt able to resolve itself as a destination without providing an alias to the server name in the server side docker.
finally when it comes to saving artifacts it seems this had to be set in task.output_uri and not in the create or init :man-shrugging:
can you attach the log of the task where ClearML SDK fails to resolve the output_uri?
the console output? or?
way im attempting to access with an id
cl_model_id = config.get("model_id")
#model = Model(model_id=cl_model_id)
model = InputModel(model_id=cl_model_id)
checkpoint = model.get_local_copy()
which logs are helpful? console output, fileserver, or api?
my main issue is that i can see that the model artefacts are here file:///root/.clearml/venvs-builds/3.11/task_repository/my_awesome_facility_model/checkpoint-33/scheduler.pt
which i believe is not persistent/retrievable with an artefact id
this is how im initializing before calling my training function. this is inside my training_script:
task = Task.init(project_name=project_name, task_name=task_name, output_uri=True)
sure!
So this is how im queuing the job:
task = Task.create(
project_name="multiclass-classifier",
task_name="training",
repo="reponame_url",
branch="branch_name",
script='training_script_name',
packages=package_list,
docker="python:3.11",
docker_args="--privileged"
)
task.enqueue(task, "services") # services queue is the one with a remote worker
docker compose on server:
apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
privileged: true
restart: unless-stopped
volumes:
- ${LOGS_DIR}:/var/log/clearml
- /opt/clearml/config:/opt/clearml/config
- ${FILESERVER_DATA_DIR}:/mnt/fileserver
depends_on:
- redis
- mongo
- elasticsearch
- fileserver
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
C...
this task is currently running, i obfuscated some personal info
local conf
api {
# Notice: 'host' is the api server (default port 8008), not the web server.
api_server:
web_server:
files_server:
# Credentials are generated using the webapp,
# Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
credentials {"access_key": "somekey", "secret_key": "somekey"}
}
# Default Task output_uri. if output_uri is not provided to Task.init, default_outp...
I was away for a week, was anyone able to come up with nay solutions for this?