Reputation
Badges 1
89 × Eureka!If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2
Thank you for getting back to me
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas?
It's hanging at
Installing collected packages: zipp, importlib-resources, rpds-py, pkgutil-resolve-name, attrs, referencing, jsonschema-specifications, jsonschema, certifi, urllib3, idna, charset-normalizer, requests, pyparsing, PyYAML, six, pathlib2, orderedmultidict, furl, pyjwt, psutil, python-dateutil, platformdirs, distlib, filelock, virtualenv, clearml-agent
Successfully installed PyYAML-6.0.2 attrs-23.2.0 certifi-2024.7.4 charset-normalizer-3.3.2 clearml-agent-1.8.1 distlib-0.3....
I am running the agent with clearml-agent daemon --queue training
ERROR: This container was built for NVIDIA Driver Release 530.30 or later, but
version 460.32.03 was detected and compatibility mode is UNAVAILABLE.
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
It seems to find a cuda 11, then it installs cuda 12
Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes:
,
,
Collecting torch==2.4.0.*
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
Using cached
...
@<1523701070390366208:profile|CostlyOstrich36> I'm now running the agent with --docker
, and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!
Hi @<1523701070390366208:profile|CostlyOstrich36> I am not specifying a version 🙂
This has been resolved now! Thank you for your help @<1523701070390366208:profile|CostlyOstrich36>
I can install on the server with this command
I have set agent{cuda_version: 11.2}
I have set agent.package_manager.pip_version=""
which resolved that message
@<1523701070390366208:profile|CostlyOstrich36> same error now 😞
Environment setup completed successfully
Starting Task Execution:
/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL:
Alternatively, go to:
to install a PyTo...
How are you getting:
beautifulsoup4 @ file:///croot/beautifulsoup4-split_1681493039619/work
This comes with the docker image ultralytics/ultralytics:latest
As I get a bunch of these warnings in both of the clones that failed
Resetting and enqueuing task which has built successfully also fails 😞
agent.package_manager.pip_version=""
DEBUG Installing build dependencies ... [?25l- \ | / - done
[?25h Getting requirements to build wheel ... [?25l- error
[1;31merror[0m: [1msubprocess-exited-with-error[0m
[31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m [31m[21 lines of output][0m
[31m [0m Traceback (most recent call last):
[31m [0m File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_i...
Setting agent.venvs_cache
path
back to ~/.clearml/venvs-cache
seems to have done the trick!
The original run completes successfully, it's only the runs cloned from the GUI which fail
In a cloned run with new container ultralytics/ultralytics:latest
I get this error:
clearml_agent: ERROR: Could not install task requirements!
Command '['/root/.clearml/venvs-builds/3.10/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqs7171xfem.txt', '--extra-index-url', '
', '--extra-index-url', '
returned non-zero exit status 1.
"Original PIP" is empty as for this task we can rely on the docker image to provide the python packages
Container nvcr.io/nvidia/pytorch:22.12-py3