Reputation
Badges 1
89 × Eureka!I can install the correct torch version with this command:pip install --pre torchvision --force-reinstall --index-url ` None ```
pip install --pre torchvision --force-reinstall --index-url
None
It did work on clearml on prem with docker_args=["--network=host", "--ipc=host"]
Code to enqueue
from clearml import Task
task = Task.create(
script="script.py",
docker="ultralytics/ultralytics:latest",
docker_args=["--network=host", "--ipc=host", "--shm_size=55G"],
)
task.enqueue(task, "default")
@<1523701070390366208:profile|CostlyOstrich36> I don't think it's related to disk, I think it's related to shm
Setting ultralytics workers=0 seems to work as per the thread above!
It seems to find a cuda 11, then it installs cuda 12
Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes:
,
,
Collecting torch==2.4.0.*
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
Using cached
...
@<1523701070390366208:profile|CostlyOstrich36> I'm now running the agent with --docker
, and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
@<1523701070390366208:profile|CostlyOstrich36> same error now 😞
Environment setup completed successfully
Starting Task Execution:
/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL:
Alternatively, go to:
to install a PyTo...
Isn't the problem that CUDA 12 is being installed?
I have set agent{cuda_version: 11.2}
Thank you for getting back to me
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas?
We are getting the dataset like this:
clearml_dataset = Dataset.get(
dataset_id=config.get("dataset_id"), alias=config.get("dataset_alias")
)
dataset_dir = clearml_dataset.get_local_copy()
Our current setup is one clearml agent per GPU on the same machine
Seems to work!
I have set agent.package_manager.pip_version=""
which resolved that message
It's hanging at
Installing collected packages: zipp, importlib-resources, rpds-py, pkgutil-resolve-name, attrs, referencing, jsonschema-specifications, jsonschema, certifi, urllib3, idna, charset-normalizer, requests, pyparsing, PyYAML, six, pathlib2, orderedmultidict, furl, pyjwt, psutil, python-dateutil, platformdirs, distlib, filelock, virtualenv, clearml-agent
Successfully installed PyYAML-6.0.2 attrs-23.2.0 certifi-2024.7.4 charset-normalizer-3.3.2 clearml-agent-1.8.1 distlib-0.3....
docker="nvidia/cuda:11.8.0-base-ubuntu20.04"
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2
I think it might be related to the new run overwriting in this location
Although that's not ideal as it turns off CPU parallelisation