Reputation
Badges 1
89 × Eureka!We are getting the dataset like this:
clearml_dataset = Dataset.get(
dataset_id=config.get("dataset_id"), alias=config.get("dataset_alias")
)
dataset_dir = clearml_dataset.get_local_copy()
I think it might be related to the new run overwriting in this location
Our current setup is one clearml agent per GPU on the same machine
Seems to work!
Code to enqueue
from clearml import Task
task = Task.create(
script="script.py",
docker="ultralytics/ultralytics:latest",
docker_args=["--network=host", "--ipc=host", "--shm_size=55G"],
)
task.enqueue(task, "default")
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�Traceback (most recent call last):
File "/root/.clearml/venvs-builds/3.10/task_repository/script.py", line 36, in <module>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protoco...
Setting ultralytics workers=0 seems to work as per the thread above!
Although that's not ideal as it turns off CPU parallelisation
@<1717350332247314432:profile|WittySeal70> what's strange is I can import the package in the docker container when I run it outside of clearML
Final answer was
docker="ultralytics/ultralytics:latest",
docker_args=["--network=host", "--ipc=host"],
pip install ultralytics --no-deps
would also work. Is there a way to pass this to clearML?
@<1523701070390366208:profile|CostlyOstrich36> I don't think it's related to disk, I think it's related to shm
It did work on clearml on prem with docker_args=["--network=host", "--ipc=host"]
The original run completes successfully, it's only the runs cloned from the GUI which fail
WARNING:clearml_agent.helper.package.requirements:Local file not found [torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/py/dist/torch_tensorrt-1.3.0a0-cp38-cp38-linux_x86_64.whl], references removed
Container nvcr.io/nvidia/pytorch:22.12-py3
agent.package_manager.pip_version=""
Thank you so much for your help @<1523701205467926528:profile|AgitatedDove14> !
As I get a bunch of these warnings in both of the clones that failed
docker="nvidia/cuda:11.8.0-base-ubuntu20.04"
Thank you for getting back to me
DEBUG Installing build dependencies ... [?25l- \ | / - done
[?25h Getting requirements to build wheel ... [?25l- error
[1;31merror[0m: [1msubprocess-exited-with-error[0m
[31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m [31m[21 lines of output][0m
[31m [0m Traceback (most recent call last):
[31m [0m File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_i...
Maybe it's related to this section?
WARNING:clearml_agent.helper.package.requirements:Local file not found [anaconda-anon-usage @ file:///croot/anaconda-anon-usage_1710965072196/work], references removed
It was pointing to a network drive before to avoid the local directory filling up
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)