FantasticSeaurchin8

Moderator

3 Questions, 26 Answers

Active since 02 August 2024

Last activity 8 months ago

Reputation

Badges 1

25 × Eureka!

Questions 3
Answers 26

0 Votes

2 Answers

447 Views

0 Votes 2 Answers 447 Views

Was There Ever A Solution To This Request?

Was there ever a solution to this request? https://faq.clear.ml/question/1546665636485140480/is-there-any-way-to-change-the-x-[…]s-to-say-e-g-epochs-instead-...

clearml

8 months ago

0 Votes

17 Answers

667 Views

0 Votes 17 Answers 667 Views

Hi All Im Trying To Save My Model Checkpoints During Runtime But Am Running Into A Confusing Snag. I'M Using The Huggingface Architecture For A Transformer. Using Their Training Module To Control Training. In The Training Args, I Have The

Hi all Im trying to save my model checkpoints during runtime but am running into a confusing snag. I'm using the HuggingFace architecture for a transformer. ...

clearml

8 months ago

0 Votes

13 Answers

746 Views

0 Votes 13 Answers 746 Views

Hi Everyone, I'M Having Trouble Setting My Output_Uri Such That My Model Checkpoints Are Saved Outside Of The Venv And Accessible Via Id Or For Download. Im Running Clear-Ml On A Remote Server Through Docker And I Believe Clearml Is Unable To Resolve The

Hi everyone, I'm having trouble setting my output_uri such that my model checkpoints are saved outside of the venv and accessible via id or for download. Im ...

clearml

8 months ago

0 Hi Everyone, I'M Having Trouble Setting My Output_Uri Such That My Model Checkpoints Are Saved Outside Of The Venv And Accessible Via Id Or For Download. Im Running Clear-Ml On A Remote Server Through Docker And I Believe Clearml Is Unable To Resolve The

local conf

api {
    # Notice: 'host' is the api server (default port 8008), not the web server.
    api_server:


    web_server:


    files_server:


    # Credentials are generated using the webapp,


    # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
    credentials {"access_key": "somekey", "secret_key": "somekey"}
}

 # Default Task output_uri. if output_uri is not provided to Task.init, default_outp...

8 months ago

docker compose on server:


  apiserver:
    command:
    - apiserver
    container_name: clearml-apiserver
    image: allegroai/clearml:latest
    privileged: true
    restart: unless-stopped
    volumes:
    - ${LOGS_DIR}:/var/log/clearml
    - /opt/clearml/config:/opt/clearml/config
    - ${FILESERVER_DATA_DIR}:/mnt/fileserver
    depends_on:
      - redis
      - mongo
      - elasticsearch
      - fileserver
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      C...

8 months ago

this task is currently running, i obfuscated some personal info

8 months ago

way im attempting to access with an id

cl_model_id = config.get("model_id")
    #model = Model(model_id=cl_model_id)
    model = InputModel(model_id=cl_model_id)
    checkpoint = model.get_local_copy()

8 months ago

which logs are helpful? console output, fileserver, or api?

my main issue is that i can see that the model artefacts are here file:///root/.clearml/venvs-builds/3.11/task_repository/my_awesome_facility_model/checkpoint-33/scheduler.pt
which i believe is not persistent/retrievable with an artefact id

8 months ago

0 Hi All Im Trying To Save My Model Checkpoints During Runtime But Am Running Into A Confusing Snag. I'M Using The Huggingface Architecture For A Transformer. Using Their Training Module To Control Training. In The Training Args, I Have The

training function:

def training(checkpoint, image_processor):

    data_test_train, labels, label_to_id, id_to_label = pre_process()

    model = AutoModelForImageClassification.from_pretrained(
        checkpoint,
        num_labels=len(labels),
        id2label=id_to_label,
        label2id=label_to_id,
        ),
    )
    def metrics(eval_pred):
        metric_val = config.get("eval_metric")
        metric = evaluate.load(metric_val)
        predictions, labels = eval_pred
        ...

8 months ago

I was away for a week, was anyone able to come up with nay solutions for this?

8 months ago

This issue was solved by adding task.output_uri = "fileserver" in the scheduling script, but for some reason this does not work when setting in task.create call in the same script but needs to be set after. it also doesnt work when being set in the training script, so there must have been some unknown overriding

8 months ago

So we have managed to get whole checkpoint files to save by removing the save_total_limit from training, this seems to save checkpoint folders with all files in it. however now we have a ballooning server.

did discover this None
and wondering if there's some nuance in autotracking that needs to be circumvented

8 months ago

None ?

8 months ago

this is how im initializing before calling my training function. this is inside my training_script:

task = Task.init(project_name=project_name, task_name=task_name, output_uri=True)

8 months ago

it would seem they are related but i cant see the further details of this bug. Either doing a manual artefact upload with task or turning tensor board tracking off in the hugging face trainer both seemed to enable json tracking within the checkpoints. But I would have thought the tensorboard behavior wasnt desired.

8 months ago

it is reachable to get data for training as image data is saved on the file server

8 months ago

sure!

So this is how im queuing the job:

task = Task.create(
    project_name="multiclass-classifier",
    task_name="training",
    repo="reponame_url",
    branch="branch_name",
    script='training_script_name',
    packages=package_list,
    docker="python:3.11",
    docker_args="--privileged"

)
task.enqueue(task, "services")  # services queue is the one with a remote worker

8 months ago

0 Was There Ever A Solution To This Request?

UI, in the dashboard. I know I could create my own custom plot and track it but it seems odd not to have epoch as configurable option

8 months ago

console output:

clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/optimizer.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/scheduler.pt
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/rng_state.pth
save_model
somemodel/checkpoint-198
clearml.Task - INFO - Completed model upload to file_server/training.e5f99149b9b/models/training_args.bin

8 months ago

further context. it saves the optimizer.pt , scheduler.pt , rng_state_path, and training_args.bin. but i cant locate the model.safetensors or meta jsons

8 months ago

so turning report_to="tensorboard", off seemed to solve the issue...as in the training run saves checkpoints as you would expect. That doesnt seem like desired behavior..

8 months ago

called with:


task = Task.init(
    project_name=project_name, task_name=task_name, output_uri="fileserver_address"
)


task.connect(config)


checkpoint = config.get("model_path")

image_processor = AutoImageProcessor.from_pretrained(
        checkpoint,
        num_labels=config.get("class_number"),
    )

best_model = training(checkpoint, image_processor)

8 months ago

need to work out if i need to reconfigure something and re-train or are my files (actual model tensor) recoverable

8 months ago

default_outputdir in the conf is set to the filerserver address same as above pointing

8 months ago

hmm i can probably provide snippets

8 months ago

queued with:

task = Task.create(
    project_name="name",
    task_name="training",
    repo="repo",
    branch="branch",
    script="training_script",
    packages=package_list,
    docker="docker_gpu_image",
    docker_args=["--network=host"],
)
task.output_uri = "filer_server"
task.enqueue(task, "training")

8 months ago

you can see here its missing some files

8 months ago

can you attach the log of the task where ClearML SDK fails to resolve the output_uri?

the console output? or?

8 months ago

0 Hi All, I Wanted To Know About Saving Datasets, We Want To Specify The Path To Gs By Default, As I Understand By Default It Uses The Path To File_Server? We Tried Sdk.Development.Default_Output_Uri =

I ran into trouble with this, i found for saving data you need to have it specified in the conf. even though as far as im aware setting it as par tof a task is supposed to overwrite this.

further i found that the server wasnt able to resolve itself as a destination without providing an alias to the server name in the server side docker.

finally when it comes to saving artifacts it seems this had to be set in task.output_uri and not in the create or init :man-shrugging:

8 months ago