Reputation
Badges 1
89 × Eureka!This has been resolved now! Thank you for your help @<1523701070390366208:profile|CostlyOstrich36>
I have set agent.package_manager.pip_version="" which resolved that message
Container nvcr.io/nvidia/pytorch:22.12-py3
Final answer was
docker="ultralytics/ultralytics:latest",
docker_args=["--network=host", "--ipc=host"],
It seems to find a cuda 11, then it installs cuda 12
Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes:
,
,
Collecting torch==2.4.0.*
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
Using cached
...
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
Thank you for getting back to me
This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)
But the process is still hanging, and not proceeding to actually running the clearml task
is this what you had on the Original manual execution ?
Yes this installed packages list is what succeeded via manual submission to agent
I am running the agent with clearml-agent daemon --queue training
I can install on the server with this command
WARNING:clearml_agent.helper.package.requirements:Local file not found [torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/py/dist/torch_tensorrt-1.3.0a0-cp38-cp38-linux_x86_64.whl], references removed
@<1523701070390366208:profile|CostlyOstrich36> thank you for your help in advance
As I get a bunch of these warnings in both of the clones that failed
Thank you for your help @<1523701205467926528:profile|AgitatedDove14>
DEBUG Installing build dependencies ... [?25l- \ | / - done
[?25h Getting requirements to build wheel ... [?25l- error
[1;31merror[0m: [1msubprocess-exited-with-error[0m
[31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m [31m[21 lines of output][0m
[31m [0m Traceback (most recent call last):
[31m [0m File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_i...
If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2
ERROR: This container was built for NVIDIA Driver Release 530.30 or later, but
version 460.32.03 was detected and compatibility mode is UNAVAILABLE.
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
It was pointing to a network drive before to avoid the local directory filling up
The original run completes successfully, it's only the runs cloned from the GUI which fail
Resetting and enqueuing task which has built successfully also fails 😞
Hi @<1523701205467926528:profile|AgitatedDove14>
ClearML Agent 1.9.0
I think it might be related to the new run overwriting in this location
@<1523701070390366208:profile|CostlyOstrich36> same error now 😞
Environment setup completed successfully
Starting Task Execution:
/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL:
Alternatively, go to:
to install a PyTo...