I Am Trying To Do A Remote Execution Of A Test Task, But It Fails During Env Setup Due To Trying To Install An Obscure Version Of Pytorch. Been Trying To Solve This For Three Days! The Script:

Answered

I am trying to do a remote execution of a test task, but it fails during env setup due to trying to install an obscure version of pytorch. Been trying to solve this for three days!

The script:
` import clearml
from clearml import Task
task = Task.init(project_name='Adhoc', task_name='Task test')
task.execute_remotely(queue_name="gpu")

import torch

print('CUDA available', torch.cuda.is_available())
if torch.cuda.is_available():
a = torch.randn(3, 5)
b = torch.randn(3, 5)

a.cuda()
b.cuda()
print(a + b)

else:
a = torch.randn(3, 5)
b = torch.randn(3, 5)
print(a + b)

from clearml import Task, Dataset
from config import DATASET_NAME, CLEARML_PROJECT
print('Getting dataset')

dataset_path = Dataset.get(
dataset_name=f'{DATASET_NAME}_sample',
dataset_project=CLEARML_PROJECT,
).get_local_copy()#.get_mutable_local_copy(DATASET_NAME)

print('Dataset path', dataset_path) I am running clearml-agent ` and executing the script on the same machine.

Clearml agent command:
(base) boris@adamastor:~/clearml_config$ CLEARML_AGENT_DISABLE_SSH_MOUNT=1 clearml-agent daemon --queue gpu --gpus 0 --foreground
Executing task command:
(base) boris@adamastor:~/plant_age$ python test_task_execution.py ClearML Task: created new task id=7837c696dc9d402582f9950801a36ef0 ClearML results page: 2022-10-04 17:53:59,661 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis 2022-10-04 17:54:01,082 - clearml.Task - INFO - Finished repository detection and package analysis 2022-10-04 17:54:10,661 - clearml - WARNING - Switching to remote execution, output log page 2022-10-04 17:54:10,661 - clearml - WARNING - Terminating local execution process
The environment has pytorch and cuda configured.
But execution fails with this:
` Torch CUDA 116 download page found
Found PyTorch version torch==1.12.1 matching CUDA version 116
Torch CUDA 116 download page found
Found PyTorch version torchvision==0.13.1 matching CUDA version 116
Collecting torch==1.12.1+cu116
File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Successfully downloaded torch
Collecting torchvision==0.13.1+cu116
File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torchvision-0.13.1+cu116-cp310-cp310-linux_x86_64.whl
Successfully downloaded torchvision
ERROR: torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform.

clearml_agent: ERROR: Could not install task requirements!
Command '['/home/boris/.clearml/venvs-builds/3.10/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqs0zc6ymzj.txt']' returned non-zero exit status 1. `
Can it just not mess with my torch installation? It’s the same environment!

Please help me, I am going crazy with this error :)

Attaching full execution log too

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Votes Newest

Answers 11

AdventurousButterfly15

Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:

Are you running the agent in venv mode? or docker mode?
Notice that in docker mode it inherits the python packages from the container, and adds/reinstalls missing packages. In venv mode it creates a New clean venv (there is no way to inherit a venv, venv can only inherit from system wide installed packages)

The idea is that you cannot expect all jobs to use the exact same setup, so the agent takes care of it, make sense? Which of the two setups (venv/docker) is more suitable for you?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Can you try to manually install it and see what you are getting?
python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

In venv mode in this case

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:
INFO:clearml_agent.commands.worker:Downloading " " to pip cache Collecting torch==1.12.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torch INFO:clearml_agent.commands.worker:Downloading " " to pip cache Collecting torchvision==0.13.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torchvision-0.13.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torchvision ERROR: torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform. INFO:clearml_agent.commands.worker:Traceback (most recent call last): File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/commands/worker.py", line 2893, in install_requirements_for_package_api package_api.load_requirements(cached_requirements) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/venv.py", line 41, in load_requirements super(VirtualenvPip, self).load_requirements(requirements) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/system.py", line 63, in load_requirements self.install_from_file(path) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/system.py", line 37, in install_from_file self.run_with_env(('install', '-r', path) + self.install_flags(), cwd=self.cwd) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/system.py", line 90, in run_with_env return (command.get_output if output else command.check_call)(stdin=DEVNULL, env=env, **kwargs) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/process.py", line 193, in check_call return self.call_subprocess(subprocess.check_call, *args, **kwargs) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/process.py", line 240, in call_subprocess return func(list(self), *args, **kwargs) File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/home/boris/.clearml/venvs-builds/3.10/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqslrbfmwej.txt']' returned non-zero exit status 1.AgitatedDove14 Is there a way to debug what it is doing during env setup?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

I will try soon

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

I resolved the issues by making my own docker image and making all envs the same:
The env that runs clearml-agent The docker env for running tasks in The env that requests task execution (my client)

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

I think I understand what the issue is, you have installed the agent on your python 3.8, but it is running and trying to install on python 3.10
To verify,
pip uninstall clearml-agent
python3.10 -m pip install clearml-agent
python3.10 -m clearml-agent daemon...

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I understand the idea, it makes sense. But it does not seem to work as intended. Why does it try to install a different pytorch? And why does it fail if it works if I do it manually? The env that’s executing the task has the same pytorch

Anyways, what should I do? So far my workers have not executed a single task, it always breaks with these env errors

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

but it fails during env setup due to trying to install an obscure version of pytorch. Been trying to solve this for three days!

AdventurousButterfly15 it tries to resolve the correct pytorch version based on the cuda inisde the container

ERROR: torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform.

seems like it is trying to install pytoch for python 3.10 with cuda 11.6 support, this seems reasonable, no?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
pytho(base) boris@adamastor:~$ python
Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
torch.version
'1.12.1'
torch.cuda.is_available()
True `

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Also manually installing this torch version succeeds:
(base) boris@adamastor:~$ python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Processing ./.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Requirement already satisfied: typing-extensions in ./miniconda3/lib/python3.10/site-packages (from torch==1.12.1+cu116) (4.3.0) Installing collected packages: torch Attempting uninstall: torch Found existing installation: torch 1.12.1 Uninstalling torch-1.12.1: Successfully uninstalled torch-1.12.1 Successfully installed torch-1.12.1+cu116But fails when agent tries to set up the env for task execution

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Write your answer

2K Views

11 Answers

2 years ago