Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
Hi Everyone


SuccessfulKoala55 No, i have local machine for develeopment and remote server for training. On that remote server i have 2 gpu and installed clearml-agent. I prepared simple example

import torch
import torch.nn as nn
import torch.optim as optim
from accelerate import Accelerator
from clearml import Task

def main():
    task = Task.init(project_name="test", task_name="accelerate_basic_ex_locallaunch_acc_simple")
    task.execute_remotely(queue_name="mls-3d-sr003-x2-h100")
    accelerator = Accelerator(log_with="clearml") # For the ClearML tracker only
    accelerator.init_trackers(
        project_name="test", 
        init_kwargs={"clearml": {"auto_connect_frameworks": False}})

    
    # Print diagnostics.
    print(f"[Process {accelerator.process_index}] Accelerator device: {accelerator.device}")
    print(f"[Process {accelerator.process_index}] (torch.cuda.device_count() = {torch.cuda.device_count()})")
    

    model = nn.Sequential(
        nn.Linear(10, 10),
        nn.ReLU(),
        nn.Linear(10, 1)
    )
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    criterion = nn.MSELoss()
    x = torch.randn(128, 10)
    y = torch.randn(128, 1)
    
    # Manually move raw tensors to the proper device.
    x = x.to(accelerator.device)
    y = y.to(accelerator.device)
    
    # Use accelerator.prepare() to wrap the model, optimizer, and data.
    model, optimizer, x, y = accelerator.prepare(model, optimizer, x, y)
    
    # Simple training loop.
    epochs = 100
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(x)
        loss = criterion(outputs, y)
        accelerator.backward(loss)
        optimizer.step()
        
        print(f"Epoch {epoch+1}/{epochs} Loss: {loss.item():.4f}")
        task.get_logger().report_scalar(title="Loss", series="train", value=loss.item(), iteration=epoch)
    
    task.close()

if __name__ == "__main__":
    main()

When i launch that code via accelerate launch --multi_gpu acc_simple.py

I recieved that stacktrace on my machine


ClearML Task: created new task id=b723094ef7424b73a90ce6a7bd40ea34
2025-02-04 09:36:48,082 - clearml.Task - INFO - No repository found, storing script code instead
2025-02-04 09:36:48,084 - clearml.Task - WARNING - Torch Distributed execution detected: Failed Detecting launch arguments, skipping
ClearML results page: 

Torch Distributed Local Rank 1 Task ID b723094ef7424b73a90ce6a7bd40ea34 detected
2025-02-04 09:36:48,286 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis
2025-02-04 09:36:48,351 - clearml.Task - INFO - Finished repository detection and package analysis
CLEARML-SERVER new package available: UPGRADE to v2.0.0 is recommended!
Release Notes:
### Breaking Changes

MongoDB major version was upgraded from v5.x to 6.x.
Please note that if your current ClearML Server version is smaller than v1.17 (where MongoDB v5.x was first used), you'll need to first upgrade to ClearML Server v1.17.
#### Upgrading to ClearML Server v1.17 from a previous version
- If using docker-compose,  use the following docker-compose files:
  * [docker-compose file](
)
  * [docker-compose file foe Windows](
)

### New Features

- New look and feel: Full light/dark themes ([clearml #1297](

- New UI task creation options
  - Support bash as well as python scripts
  - Support file upload
- New UI setting for configuring cloud storage credentials with which ClearML can clean up cloud storage artifacts on task deletion. 
- Add UI scalar plots presentation of plots in sections grouped by metrics.
- Add UI Batch export plot embed codes for all metric plots in a single click.
- Add UI pipeline presentation of steps grouped into stages

### Bug Fixes
- Fix UI Model Endpoint's Number of Requests plot sometimes displays incorrect data
- Fix UI datasets page does not filter according to project when dataset is running 
- Fix UI task scalar legend does not change colors when smoothing is enabled 
- Fix queue list in UI Workers and Queues page does not alphabetically sort by queue display name 
- Fix queue display name is not searchable in UI Task Creation modal's queue field

Switching to remote execution, output log page 

ClearML Terminating local execution process - continuing execution remotely
Traceback (most recent call last):
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/accelerate_playground/acc_simple.py", line 56, in <module>
    main()
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/accelerate_playground/acc_simple.py", line 12, in main
    task.execute_remotely(queue_name="mls-3d-sr003-x2-h100")
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/task.py", line 3324, in execute_remotely
    Task.enqueue(task, queue_name=queue_name)
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/task.py", line 1698, in enqueue
    raise exception
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/task.py", line 1689, in enqueue
    res = cls._send(session=session, req=req)
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/backend_interface/base.py", line 107, in _send
    raise SendError(res, error_msg)
clearml.backend_interface.session.SendError: Action failed <400/706: tasks.enqueue/v1.0 (Failed adding task to queue since task is already queued: task=b723094ef7424b73a90ce6a7bd40ea34)> (queue=ac521419abfe467d94a50a3a6475f04e, task=b723094ef7424b73a90ce6a7bd40ea34, verify_watched_queue=False)
E0204 09:37:00.022000 1515676 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 1515750) of binary: /home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/bin/python
Traceback (most recent call last):
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
    multi_gpu_launcher(args)
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
acc_simple.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-04_09:37:00
  host      : pc
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1515750)
  error_file: <N/A>
  traceback : To enable traceback see: 

============================================================

Than remote training started but onlu on one gpu

  
  
Posted one month ago
37 Views
0 Answers
one month ago
one month ago