Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello! Are You Able To Help Be Debug This Message?

Hello! Are you able to help be debug this message?

RuntimeError: unable to write to file </torch_622_543991917_4>: No space left on device (28)
2024-09-09 14:29:50,124 - clearml.reporter - WARNING - Exception encountered cleaning up the reporter: DataLoader worker (pid 670) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
2024-09-09 15:30:00
Process failed, exit code 1
  
  
Posted one month ago
Votes Newest

Answers 13


Code to enqueue

from clearml import Task

task = Task.create(
    script="script.py",
    docker="ultralytics/ultralytics:latest",
    docker_args=["--network=host", "--ipc=host", "--shm_size=55G"],
)

task.enqueue(task, "default")
  
  
Posted one month ago

Running on K8s on AWS

  
  
Posted one month ago

I think you're right. But it looks like an infrastructure issue related to Yolo

  
  
Posted one month ago

It did work on clearml on prem with docker_args=["--network=host", "--ipc=host"]

  
  
Posted one month ago

@<1523701070390366208:profile|CostlyOstrich36> I don't think it's related to disk, I think it's related to shm

  
  
Posted one month ago

But I could be wrong

  
  
Posted one month ago

None

  
  
Posted one month ago

On prem is not K8s

  
  
Posted one month ago

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.10/task_repository/script.py", line 36, in <module>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage
    fd, size = storage._share_fd_cpu_()
  File "/opt/conda/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper
    return fn(self, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fd_cpu_
    return super()._share_fd_cpu_(*args, **kwargs)
RuntimeError: unable to write to file </torch_670_1874874997_0>: No space left on device (28)
    results = model.train(
  File "/ultralytics/ultralytics/engine/model.py", line 815, in train
    self.trainer.train()
  File "/ultralytics/ultralytics/engine/trainer.py", line 208, in train
    self._do_train(world_size)
  File "/ultralytics/ultralytics/engine/trainer.py", line 328, in _do_train
    self._setup_train(world_size)
  File "/ultralytics/ultralytics/engine/trainer.py", line 295, in _setup_train
    self.test_loader = self.get_dataloader(
  File "/ultralytics/ultralytics/models/yolo/detect/train.py", line 55, in get_dataloader
    return build_dataloader(dataset, batch_size, workers, shuffle, rank)  # return dataloader
  File "/ultralytics/ultralytics/data/build.py", line 135, in build_dataloader
    return InfiniteDataLoader(
  File "/ultralytics/ultralytics/data/build.py", line 39, in __init__
    self.iterator = super().__iter__()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__
    return self._get_iterator()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 387, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1022, in __init__
    index_queue = multiprocessing_context.Queue()  # type: ignore[var-annotated]
  File "/opt/conda/lib/python3.10/multiprocessing/context.py", line 103, in Queue
    return Queue(maxsize, ctx=self.get_context())
  File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 43, in __init__
    self._rlock = ctx.Lock()
  File "/opt/conda/lib/python3.10/multiprocessing/context.py", line 68, in Lock
    return Lock(ctx=self.get_context())
  File "/opt/conda/lib/python3.10/multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/opt/conda/lib/python3.10/multiprocessing/synchronize.py", line 57, in __init__
    sl = self._semlock = _multiprocessing.SemLock(
OSError: [Errno 28] No space left on device
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage
    fd, size = storage._share_fd_cpu_()
  File "/opt/conda/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper
    return fn(self, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fd_cpu_
    return super()._share_fd_cpu_(*args, **kwargs)
RuntimeError: unable to write to file </torch_630_2165375255_1>: No space left on device (28)
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�Traceback (most recent call last):
  
  
Posted one month ago

Setting ultralytics workers=0 seems to work as per the thread above!

  
  
Posted one month ago

Although that's not ideal as it turns off CPU parallelisation

  
  
Posted one month ago

Hi @<1734020162731905024:profile|RattyBluewhale45> , from the error it looks like there is no space left on the pod. Are you able to run this code manually?

  
  
Posted one month ago

On prem is also K8s? Question is if you run the code unrelated to ClearML on EKS, do you still get the same issue?

  
  
Posted one month ago
126 Views
13 Answers
one month ago
one month ago
Tags