I understand the idea, it makes sense. But it does not seem to work as intended. Why does it try to install a different pytorch? And why does it fail if it works if I do it manually? The env that’s executing the task has the same pytorch
Anyways, what should I do? So far my workers have not executed a single task, it always breaks with these env errors
I think I understand what the issue is, you have installed the agent on your python 3.8, but it is running and trying to install on python 3.10
To verify,
pip uninstall clearml-agent
python3.10 -m pip install clearml-agent
python3.10 -m clearml-agent daemon...
AdventurousButterfly15
Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:
Are you running the agent in venv mode? or docker mode?
Notice that in docker mode it inherits the python packages from the container, and adds/reinstalls missing packages. In venv mode it creates a New clean venv (there is no way to inherit a venv, venv can only inherit from system wide installed packages)
The idea is that you cannot expect all jobs to use the exact same setup, so the agent takes care of it, make sense? Which of the two setups (venv/docker) is more suitable for you?
Can you try to manually install it and see what you are getting?python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
but it fails during env setup due to trying to install an obscure version of pytorch. Been trying to solve this for three days!
AdventurousButterfly15 it tries to resolve the correct pytorch version based on the cuda inisde the container
ERROR: torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform.
seems like it is trying to install pytoch for python 3.10 with cuda 11.6 support, this seems reasonable, no?
Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:INFO:clearml_agent.commands.worker:Downloading "
" to pip cache Collecting torch==1.12.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torch INFO:clearml_agent.commands.worker:Downloading "
" to pip cache Collecting torchvision==0.13.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torchvision-0.13.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torchvision ERROR: torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform. INFO:clearml_agent.commands.worker:Traceback (most recent call last): File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/commands/worker.py", line 2893, in install_requirements_for_package_api package_api.load_requirements(cached_requirements) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/venv.py", line 41, in load_requirements super(VirtualenvPip, self).load_requirements(requirements) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/system.py", line 63, in load_requirements self.install_from_file(path) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/system.py", line 37, in install_from_file self.run_with_env(('install', '-r', path) + self.install_flags(), cwd=self.cwd) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/system.py", line 90, in run_with_env return (command.get_output if output else command.check_call)(stdin=DEVNULL, env=env, **kwargs) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/process.py", line 193, in check_call return self.call_subprocess(subprocess.check_call, *args, **kwargs) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/process.py", line 240, in call_subprocess return func(list(self), *args, **kwargs) File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/home/boris/.clearml/venvs-builds/3.10/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqslrbfmwej.txt']' returned non-zero exit status 1.
AgitatedDove14 Is there a way to debug what it is doing during env setup?
I resolved the issues by making my own docker image and making all envs the same:
The env that runs clearml-agent The docker env for running tasks in The env that requests task execution (my client)
Also manually installing this torch version succeeds:(base) boris@adamastor:~$ python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Processing ./.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Requirement already satisfied: typing-extensions in ./miniconda3/lib/python3.10/site-packages (from torch==1.12.1+cu116) (4.3.0) Installing collected packages: torch Attempting uninstall: torch Found existing installation: torch 1.12.1 Uninstalling torch-1.12.1: Successfully uninstalled torch-1.12.1 Successfully installed torch-1.12.1+cu116
But fails when agent tries to set up the env for task execution
I don’t understand. The current cuda version is 11.7. Installed pytorch version is 1.12.1. Torch can access GPUs, all is fine.
Why does it try to install a different torch version?
` (base) boris@adamastor:~$ nvidia-smi
Fri Oct 7 14:16:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 Off | 00000000:31:00.0 Off | Off |
| 0% 40C P8 23W / 150W | 4MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A10 Off | 00000000:4B:00.0 Off | 0 |
| 0% 41C P8 23W / 150W | 4MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A10 Off | 00000000:B1:00.0 Off | Off |
| 0% 44C P8 24W / 150W | 4MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A10 Off | 00000000:CA:00.0 Off | Off |
| 0% 38C P8 21W / 150W | 4MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
pytho(base) boris@adamastor:~$ python
Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
import torch
torch.version
'1.12.1'
torch.cuda.is_available()
True `