Reputation
Badges 1
89 × Eureka!Code to enqueue
from clearml import Task
task = Task.create(
script="script.py",
docker="ultralytics/ultralytics:latest",
docker_args=["--network=host", "--ipc=host", "--shm_size=55G"],
)
task.enqueue(task, "default")
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�Traceback (most recent call last):
File "/root/.clearml/venvs-builds/3.10/task_repository/script.py", line 36, in <module>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protoco...
Thanks @<1523701205467926528:profile|AgitatedDove14> , will take a look
Setting agent.venvs_cache path back to ~/.clearml/venvs-cache seems to have done the trick!
Our current setup is one clearml agent per GPU on the same machine
Seems to work!
Trying this:
clearml_dataset = Dataset.get(
dataset_id=config.get("dataset_id"), alias=config.get("dataset_alias")
)
dataset_dir = clearml_dataset.get_local_copy()
destination_dir = os.path.join("/datasets", os.path.basename(dataset_dir))
shutil.copytree(dataset_dir, destination_dir)
results = model.train(
data=destination_dir + "/data.yaml", epochs=config.get("epochs"), device=0
)
How to replicate on ClearML:
task = Task.create(
script="myscript.py",
packages=["opencv-python==4.6.*", "ultralytics"],
docker="nvcr.io/nvidia/pytorch:22.12-py3",
)
Contents of myscript.py:from ultralytics import YOLO
Although that's not ideal as it turns off CPU parallelisation
But that doesn't explain why the model JSON files are missing.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas? Thank you
Try save_safetensors=False in TrainingArguments . Not sure if clearML supports safetensors
Setting ultralytics workers=0 seems to work as per the thread above!
Thank you so much for your help @<1523701205467926528:profile|AgitatedDove14> !
@<1523701070390366208:profile|CostlyOstrich36> I don't think it's related to disk, I think it's related to shm
On local I am able to import ultralytics in this docker imagedocker run --gpus 1 -it nvcr.io/nvidia/pytorch:22.12-py3# pip install opencv-python==4.6.* ultralytics# python
>>> from ultralytics import YOLO
>>>
What does ClearML do differently that leads to a failure here?
to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!
@<1523701070390366208:profile|CostlyOstrich36> I'm now running the agent with --docker , and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
pip install --pre torchvision --force-reinstall --index-url None
Hi @<1523701070390366208:profile|CostlyOstrich36> I am not specifying a version 🙂
Full log for the failed clone
Hey, yes I can see machine statistics on the experiments themselves
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)