Reputation
Badges 1
89 × Eureka!This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)
If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2
I have set agent.package_manager.pip_version=""
which resolved that message
Solved that by setting docker_args=["--privileged", "--network=host"]
It seems to find a cuda 11, then it installs cuda 12
Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes:
,
,
Collecting torch==2.4.0.*
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
Using cached
...
Hi @<1523701070390366208:profile|CostlyOstrich36> I am not specifying a version 🙂
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas?
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
I can install the correct torch version with this command:pip install --pre torchvision --force-reinstall --index-url ` None ```
Isn't the problem that CUDA 12 is being installed?
Hey, yes I can see machine statistics on the experiments themselves
Seems to work!
@<1523701205467926528:profile|AgitatedDove14> if we go with the ultralytics case:
INSTALLED PACKAGES for working manual execution
absl-py==2.1.0
albucore==0.0.13
albumentations==1.4.14
anaconda-anon-usage @ file:///croot/anaconda-anon-usage_1710965072196/work
annotated-types==0.7.0
anyio==4.4.0
archspec @ file:///croot/archspec_1709217642129/work
astor==0.8.1
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
astunparse==1.6.3
attrs @ file:///croot/attrs_169571782329...
In a cloned run with new container ultralytics/ultralytics:latest
I get this error:
clearml_agent: ERROR: Could not install task requirements!
Command '['/root/.clearml/venvs-builds/3.10/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqs7171xfem.txt', '--extra-index-url', '
', '--extra-index-url', '
returned non-zero exit status 1.
The original run completes successfully, it's only the runs cloned from the GUI which fail
Resetting and enqueuing task which has built successfully also fails 😞
Thank you for your help @<1523701205467926528:profile|AgitatedDove14>
Hi @<1523701205467926528:profile|AgitatedDove14>
ClearML Agent 1.9.0
Setting agent.venvs_cache
path
back to ~/.clearml/venvs-cache
seems to have done the trick!
It was pointing to a network drive before to avoid the local directory filling up
Thank you so much for your help @<1523701205467926528:profile|AgitatedDove14> !
Thanks @<1523701205467926528:profile|AgitatedDove14> , will take a look
But that doesn't explain why the model JSON files are missing.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas? Thank you
Try save_safetensors=False
in TrainingArguments
. Not sure if clearML supports safetensors