Hi Everyone

Answered

Hi Everyone

Hi everyone !

I am trying to launch multigpu training on remote clearm agent and stuck, because when i do accelerate launch train.py remote agent doesn’t override accelerate conf but use local one from my maschine where i have only one gpu! I tried to do some tricks and insert all config and accelerate logic in train.py, but it very uncomfortable to use. So, my question - is clearml agents support accelerate functional or not? Maybe i do smth wrong or you can advise me more good solution to solve the problem

  				
Posted 
	one month ago

					More  		
  Report
		
					PanickyDolphin50
				
					0
					 × 1

Votes Newest

Answers 6

SuccessfulKoala55 No, i have local machine for develeopment and remote server for training. On that remote server i have 2 gpu and installed clearml-agent. I prepared simple example

import torch
import torch.nn as nn
import torch.optim as optim
from accelerate import Accelerator
from clearml import Task

def main():
    task = Task.init(project_name="test", task_name="accelerate_basic_ex_locallaunch_acc_simple")
    task.execute_remotely(queue_name="mls-3d-sr003-x2-h100")
    accelerator = Accelerator(log_with="clearml") # For the ClearML tracker only
    accelerator.init_trackers(
        project_name="test", 
        init_kwargs={"clearml": {"auto_connect_frameworks": False}})

    
    # Print diagnostics.
    print(f"[Process {accelerator.process_index}] Accelerator device: {accelerator.device}")
    print(f"[Process {accelerator.process_index}] (torch.cuda.device_count() = {torch.cuda.device_count()})")
    

    model = nn.Sequential(
        nn.Linear(10, 10),
        nn.ReLU(),
        nn.Linear(10, 1)
    )
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    criterion = nn.MSELoss()
    x = torch.randn(128, 10)
    y = torch.randn(128, 1)
    
    # Manually move raw tensors to the proper device.
    x = x.to(accelerator.device)
    y = y.to(accelerator.device)
    
    # Use accelerator.prepare() to wrap the model, optimizer, and data.
    model, optimizer, x, y = accelerator.prepare(model, optimizer, x, y)
    
    # Simple training loop.
    epochs = 100
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(x)
        loss = criterion(outputs, y)
        accelerator.backward(loss)
        optimizer.step()
        
        print(f"Epoch {epoch+1}/{epochs} Loss: {loss.item():.4f}")
        task.get_logger().report_scalar(title="Loss", series="train", value=loss.item(), iteration=epoch)
    
    task.close()

if __name__ == "__main__":
    main()

When i launch that code via accelerate launch --multi_gpu acc_simple.py

I recieved that stacktrace on my machine


ClearML Task: created new task id=b723094ef7424b73a90ce6a7bd40ea34
2025-02-04 09:36:48,082 - clearml.Task - INFO - No repository found, storing script code instead
2025-02-04 09:36:48,084 - clearml.Task - WARNING - Torch Distributed execution detected: Failed Detecting launch arguments, skipping
ClearML results page:


Torch Distributed Local Rank 1 Task ID b723094ef7424b73a90ce6a7bd40ea34 detected
2025-02-04 09:36:48,286 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis
2025-02-04 09:36:48,351 - clearml.Task - INFO - Finished repository detection and package analysis
CLEARML-SERVER new package available: UPGRADE to v2.0.0 is recommended!
Release Notes:
### Breaking Changes

MongoDB major version was upgraded from v5.x to 6.x.
Please note that if your current ClearML Server version is smaller than v1.17 (where MongoDB v5.x was first used), you'll need to first upgrade to ClearML Server v1.17.
#### Upgrading to ClearML Server v1.17 from a previous version
- If using docker-compose,  use the following docker-compose files:
  * [docker-compose file](

)
  * [docker-compose file foe Windows](

)

### New Features

- New look and feel: Full light/dark themes ([clearml #1297](


- New UI task creation options
  - Support bash as well as python scripts
  - Support file upload
- New UI setting for configuring cloud storage credentials with which ClearML can clean up cloud storage artifacts on task deletion. 
- Add UI scalar plots presentation of plots in sections grouped by metrics.
- Add UI Batch export plot embed codes for all metric plots in a single click.
- Add UI pipeline presentation of steps grouped into stages

### Bug Fixes
- Fix UI Model Endpoint's Number of Requests plot sometimes displays incorrect data
- Fix UI datasets page does not filter according to project when dataset is running 
- Fix UI task scalar legend does not change colors when smoothing is enabled 
- Fix queue list in UI Workers and Queues page does not alphabetically sort by queue display name 
- Fix queue display name is not searchable in UI Task Creation modal's queue field

Switching to remote execution, output log page


ClearML Terminating local execution process - continuing execution remotely
Traceback (most recent call last):
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/accelerate_playground/acc_simple.py", line 56, in <module>
    main()
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/accelerate_playground/acc_simple.py", line 12, in main
    task.execute_remotely(queue_name="mls-3d-sr003-x2-h100")
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/task.py", line 3324, in execute_remotely
    Task.enqueue(task, queue_name=queue_name)
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/task.py", line 1698, in enqueue
    raise exception
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/task.py", line 1689, in enqueue
    res = cls._send(session=session, req=req)
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/backend_interface/base.py", line 107, in _send
    raise SendError(res, error_msg)
clearml.backend_interface.session.SendError: Action failed <400/706: tasks.enqueue/v1.0 (Failed adding task to queue since task is already queued: task=b723094ef7424b73a90ce6a7bd40ea34)> (queue=ac521419abfe467d94a50a3a6475f04e, task=b723094ef7424b73a90ce6a7bd40ea34, verify_watched_queue=False)
E0204 09:37:00.022000 1515676 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 1515750) of binary: /home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/bin/python
Traceback (most recent call last):
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
    multi_gpu_launcher(args)
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
acc_simple.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-04_09:37:00
  host      : pc
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1515750)
  error_file: <N/A>
  traceback : To enable traceback see:


============================================================

Than remote training started but onlu on one gpu

  				
Posted 
	one month ago

					More  		
  Report
		
					PanickyDolphin50
				
					0
					 × 1

PanickyDolphin50 when you say the agent loads the accelerate conf from your local machine, what do you mean? Is that where the agent is running?

  				
Posted 
	one month ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 Can you give me any advice or workaround to run accelerate on remote agent pls 🥺

  				
Posted 
	one month ago

					More  		
  Report
		
					PanickyDolphin50
				
					0
					 × 1

I see that there is two processes, but i dont understand how to properly log them and send both of them on remote server

  				
Posted 
	one month ago

					More  		
  Report
		
					PanickyDolphin50
				
					0
					 × 1

Hi PanickyDolphin50 , can you please elaborate? What is this accelerate functionality?

  				
Posted 
	one month ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

I just want to use multi gpu training for my model and use hf accelerate framework for that. It helps to do it very simple, just several imports and model training distribute across all gpus. The problem, I think, is in the standard way to using that framework- you should launch that via cli and give you python script as a parameter. When you launch it on maschine everything is fine , but when i try to launch remote task on clearml agent multigpu training simply doesn’t work.

  				
Posted 
	one month ago

					More  		
  Report
		
					PanickyDolphin50
				
					0
					 × 1

Write your answer

173 Views

6 Answers

one month ago