
Reputation
Badges 1
89 × Eureka!It seems to find a cuda 11, then it installs cuda 12
Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes:
,
,
Collecting torch==2.4.0.*
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
Using cached
...
@<1717350332247314432:profile|WittySeal70> what's strange is I can import the package in the docker container when I run it outside of clearML
Full log for the failed clone
I can install the correct torch version with this command:pip install --pre torchvision --force-reinstall --index-url ` None ```
[2024-08-13 16:56:36,447] [9] [INFO] [clearml.service_repo] Returned 200 for workers.get_activity_report in 342ms
[2024-08-13 16:56:36,462] [9] [INFO] [clearml.service_repo] Returned 200 for workers.get_activity_report in 261ms
Hi @<1523701205467926528:profile|AgitatedDove14>
ClearML Agent 1.9.0
It was pointing to a network drive before to avoid the local directory filling up
The original run completes successfully, it's only the runs cloned from the GUI which fail
Maybe it's related to this section?
WARNING:clearml_agent.helper.package.requirements:Local file not found [anaconda-anon-usage @ file:///croot/anaconda-anon-usage_1710965072196/work], references removed
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
Seems to work!
I have set agent.package_manager.pip_version=""
which resolved that message
Hi @<1523701070390366208:profile|CostlyOstrich36> I am not specifying a version 🙂
This has been resolved now! Thank you for your help @<1523701070390366208:profile|CostlyOstrich36>
pip install --pre torchvision --force-reinstall --index-url
None
But the process is still hanging, and not proceeding to actually running the clearml task
Although that's not ideal as it turns off CPU parallelisation
Thanks @<1523701205467926528:profile|AgitatedDove14> , will take a look
In a cloned run with new container ultralytics/ultralytics:latest
I get this error:
clearml_agent: ERROR: Could not install task requirements!
Command '['/root/.clearml/venvs-builds/3.10/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqs7171xfem.txt', '--extra-index-url', '
', '--extra-index-url', '
returned non-zero exit status 1.
is this what you had on the Original manual execution ?
Yes this installed packages list is what succeeded via manual submission to agent
I can install on the server with this command
docker="nvidia/cuda:11.8.0-base-ubuntu20.04"
ERROR: This container was built for NVIDIA Driver Release 530.30 or later, but
version 460.32.03 was detected and compatibility mode is UNAVAILABLE.
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
@<1523701070390366208:profile|CostlyOstrich36> I'm now running the agent with --docker
, and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
DEBUG Installing build dependencies ... [?25l- \ | / - done
[?25h Getting requirements to build wheel ... [?25l- error
[1;31merror[0m: [1msubprocess-exited-with-error[0m
[31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m [31m[21 lines of output][0m
[31m [0m Traceback (most recent call last):
[31m [0m File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_i...