to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!
pip install --pre torchvision --force-reinstall --index-url
None
Isn't the problem that CUDA 12 is being installed?
I can install the correct torch version with this command:pip install --pre torchvision --force-reinstall --index-url ` None ```
OK, then just try the docker image I suggested 🙂
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
Just to make sure, run the code on the machine itself to verify that python can actually detect the driver
I am running the agent with clearml-agent daemon --queue training
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas?
I think it tries to get the latest one. Are you using the agent in docker mode? you can also control this via clearml.conf
with agent.cuda_version