SuccessfulKoala55 No, i have local machine for develeopment and remote server for training. On that remote server i have 2 gpu and installed clearml-agent. I prepared simple example
import torch
import torch.nn as nn
import torch.optim as optim
from accelerate import Accelerator
from clearml import Task
def main():
task = Task.init(project_name="test", task_name="accelerate_basic_ex_locallaunch_acc_simple")
task.execute_remotely(queue_name="mls-3d-sr003-x2-h100")
accelerator = Accelerator(log_with="clearml") # For the ClearML tracker only
accelerator.init_trackers(
project_name="test",
init_kwargs={"clearml": {"auto_connect_frameworks": False}})
# Print diagnostics.
print(f"[Process {accelerator.process_index}] Accelerator device: {accelerator.device}")
print(f"[Process {accelerator.process_index}] (torch.cuda.device_count() = {torch.cuda.device_count()})")
model = nn.Sequential(
nn.Linear(10, 10),
nn.ReLU(),
nn.Linear(10, 1)
)
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
x = torch.randn(128, 10)
y = torch.randn(128, 1)
# Manually move raw tensors to the proper device.
x = x.to(accelerator.device)
y = y.to(accelerator.device)
# Use accelerator.prepare() to wrap the model, optimizer, and data.
model, optimizer, x, y = accelerator.prepare(model, optimizer, x, y)
# Simple training loop.
epochs = 100
for epoch in range(epochs):
optimizer.zero_grad()
outputs = model(x)
loss = criterion(outputs, y)
accelerator.backward(loss)
optimizer.step()
print(f"Epoch {epoch+1}/{epochs} Loss: {loss.item():.4f}")
task.get_logger().report_scalar(title="Loss", series="train", value=loss.item(), iteration=epoch)
task.close()
if __name__ == "__main__":
main()
When i launch that code via accelerate launch --multi_gpu acc_simple.py
I recieved that stacktrace on my machine
ClearML Task: created new task id=b723094ef7424b73a90ce6a7bd40ea34
2025-02-04 09:36:48,082 - clearml.Task - INFO - No repository found, storing script code instead
2025-02-04 09:36:48,084 - clearml.Task - WARNING - Torch Distributed execution detected: Failed Detecting launch arguments, skipping
ClearML results page:
Torch Distributed Local Rank 1 Task ID b723094ef7424b73a90ce6a7bd40ea34 detected
2025-02-04 09:36:48,286 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis
2025-02-04 09:36:48,351 - clearml.Task - INFO - Finished repository detection and package analysis
CLEARML-SERVER new package available: UPGRADE to v2.0.0 is recommended!
Release Notes:
### Breaking Changes
MongoDB major version was upgraded from v5.x to 6.x.
Please note that if your current ClearML Server version is smaller than v1.17 (where MongoDB v5.x was first used), you'll need to first upgrade to ClearML Server v1.17.
#### Upgrading to ClearML Server v1.17 from a previous version
- If using docker-compose, use the following docker-compose files:
* [docker-compose file](
)
* [docker-compose file foe Windows](
)
### New Features
- New look and feel: Full light/dark themes ([clearml #1297](
- New UI task creation options
- Support bash as well as python scripts
- Support file upload
- New UI setting for configuring cloud storage credentials with which ClearML can clean up cloud storage artifacts on task deletion.
- Add UI scalar plots presentation of plots in sections grouped by metrics.
- Add UI Batch export plot embed codes for all metric plots in a single click.
- Add UI pipeline presentation of steps grouped into stages
### Bug Fixes
- Fix UI Model Endpoint's Number of Requests plot sometimes displays incorrect data
- Fix UI datasets page does not filter according to project when dataset is running
- Fix UI task scalar legend does not change colors when smoothing is enabled
- Fix queue list in UI Workers and Queues page does not alphabetically sort by queue display name
- Fix queue display name is not searchable in UI Task Creation modal's queue field
Switching to remote execution, output log page
ClearML Terminating local execution process - continuing execution remotely
Traceback (most recent call last):
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/accelerate_playground/acc_simple.py", line 56, in <module>
main()
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/accelerate_playground/acc_simple.py", line 12, in main
task.execute_remotely(queue_name="mls-3d-sr003-x2-h100")
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/task.py", line 3324, in execute_remotely
Task.enqueue(task, queue_name=queue_name)
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/task.py", line 1698, in enqueue
raise exception
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/task.py", line 1689, in enqueue
res = cls._send(session=session, req=req)
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/clearml/backend_interface/base.py", line 107, in _send
raise SendError(res, error_msg)
clearml.backend_interface.session.SendError: Action failed <400/706: tasks.enqueue/v1.0 (Failed adding task to queue since task is already queued: task=b723094ef7424b73a90ce6a7bd40ea34)> (queue=ac521419abfe467d94a50a3a6475f04e, task=b723094ef7424b73a90ce6a7bd40ea34, verify_watched_queue=False)
E0204 09:37:00.022000 1515676 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 1515750) of binary: /home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/bin/python
Traceback (most recent call last):
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
multi_gpu_launcher(args)
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
distrib_run.run(args)
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/apocrf/code/clearml_accelearte_playground/accelerate_playground/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
acc_simple.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-04_09:37:00
host : pc
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1515750)
error_file: <N/A>
traceback : To enable traceback see:
============================================================
Than remote training started but onlu on one gpu
PanickyDolphin50 when you say the agent loads the accelerate conf from your local machine, what do you mean? Is that where the agent is running?
SuccessfulKoala55 Can you give me any advice or workaround to run accelerate on remote agent pls 🥺
I see that there is two processes, but i dont understand how to properly log them and send both of them on remote server
Hi PanickyDolphin50 , can you please elaborate? What is this accelerate functionality?
I just want to use multi gpu training for my model and use hf accelerate framework for that. It helps to do it very simple, just several imports and model training distribute across all gpus. The problem, I think, is in the standard way to using that framework- you should launch that via cli and give you python script as a parameter. When you launch it on maschine everything is fine , but when i try to launch remote task on clearml agent multigpu training simply doesn’t work.