Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
I Am Trying To Do A Remote Execution Of A Test Task, But It Fails During Env Setup Due To Trying To Install An Obscure Version Of Pytorch. Been Trying To Solve This For Three Days! The Script:

I am trying to do a remote execution of a test task, but it fails during env setup due to trying to install an obscure version of pytorch. Been trying to solve this for three days!

The script:
` import clearml
from clearml import Task
task = Task.init(project_name='Adhoc', task_name='Task test')
task.execute_remotely(queue_name="gpu")

import torch

print('CUDA available', torch.cuda.is_available())
if torch.cuda.is_available():
a = torch.randn(3, 5)
b = torch.randn(3, 5)

a.cuda()
b.cuda()
print(a + b)

else:
a = torch.randn(3, 5)
b = torch.randn(3, 5)
print(a + b)

from clearml import Task, Dataset
from config import DATASET_NAME, CLEARML_PROJECT
print('Getting dataset')

dataset_path = Dataset.get(
dataset_name=f'{DATASET_NAME}_sample',
dataset_project=CLEARML_PROJECT,
).get_local_copy()#.get_mutable_local_copy(DATASET_NAME)

print('Dataset path', dataset_path) I am running clearml-agent ` and executing the script on the same machine.

Clearml agent command:
(base) boris@adamastor:~/clearml_config$ CLEARML_AGENT_DISABLE_SSH_MOUNT=1 clearml-agent daemon --queue gpu --gpus 0 --foreground
Executing task command:
(base) boris@adamastor:~/plant_age$ python test_task_execution.py ClearML Task: created new task id=7837c696dc9d402582f9950801a36ef0 ClearML results page: 2022-10-04 17:53:59,661 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis 2022-10-04 17:54:01,082 - clearml.Task - INFO - Finished repository detection and package analysis 2022-10-04 17:54:10,661 - clearml - WARNING - Switching to remote execution, output log page 2022-10-04 17:54:10,661 - clearml - WARNING - Terminating local execution process
The environment has pytorch and cuda configured.
But execution fails with this:
` Torch CUDA 116 download page found
Found PyTorch version torch==1.12.1 matching CUDA version 116
Torch CUDA 116 download page found
Found PyTorch version torchvision==0.13.1 matching CUDA version 116
Collecting torch==1.12.1+cu116
File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Successfully downloaded torch
Collecting torchvision==0.13.1+cu116
File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torchvision-0.13.1+cu116-cp310-cp310-linux_x86_64.whl
Successfully downloaded torchvision
ERROR: torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform.

clearml_agent: ERROR: Could not install task requirements!
Command '['/home/boris/.clearml/venvs-builds/3.10/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqs0zc6ymzj.txt']' returned non-zero exit status 1. `
Can it just not mess with my torch installation? It’s the same environment!

Please help me, I am going crazy with this error :)

Attaching full execution log too

  
  
Posted 2 years ago
Votes Newest

Answers 11


I don’t understand. The current cuda version is 11.7. Installed pytorch version is 1.12.1. Torch can access GPUs, all is fine.
Why does it try to install a different torch version?
` (base) boris@adamastor:~$ nvidia-smi
Fri Oct 7 14:16:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 Off | 00000000:31:00.0 Off | Off |
| 0% 40C P8 23W / 150W | 4MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A10 Off | 00000000:4B:00.0 Off | 0 |
| 0% 41C P8 23W / 150W | 4MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A10 Off | 00000000:B1:00.0 Off | Off |
| 0% 44C P8 24W / 150W | 4MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A10 Off | 00000000:CA:00.0 Off | Off |
| 0% 38C P8 21W / 150W | 4MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1875 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
pytho(base) boris@adamastor:~$ python
Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
torch.version
'1.12.1'
torch.cuda.is_available()
True `

  
  
Posted 2 years ago

AdventurousButterfly15

Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:

Are you running the agent in venv mode? or docker mode?
Notice that in docker mode it inherits the python packages from the container, and adds/reinstalls missing packages. In venv mode it creates a New clean venv (there is no way to inherit a venv, venv can only inherit from system wide installed packages)

The idea is that you cannot expect all jobs to use the exact same setup, so the agent takes care of it, make sense? Which of the two setups (venv/docker) is more suitable for you?

  
  
Posted 2 years ago

Also manually installing this torch version succeeds:
(base) boris@adamastor:~$ python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Processing ./.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Requirement already satisfied: typing-extensions in ./miniconda3/lib/python3.10/site-packages (from torch==1.12.1+cu116) (4.3.0) Installing collected packages: torch Attempting uninstall: torch Found existing installation: torch 1.12.1 Uninstalling torch-1.12.1: Successfully uninstalled torch-1.12.1 Successfully installed torch-1.12.1+cu116But fails when agent tries to set up the env for task execution

  
  
Posted 2 years ago

I understand the idea, it makes sense. But it does not seem to work as intended. Why does it try to install a different pytorch? And why does it fail if it works if I do it manually? The env that’s executing the task has the same pytorch

Anyways, what should I do? So far my workers have not executed a single task, it always breaks with these env errors

  
  
Posted 2 years ago

but it fails during env setup due to trying to install an obscure version of pytorch. Been trying to solve this for three days!

AdventurousButterfly15 it tries to resolve the correct pytorch version based on the cuda inisde the container

ERROR: torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform.

seems like it is trying to install pytoch for python 3.10 with cuda 11.6 support, this seems reasonable, no?

  
  
Posted 2 years ago

In venv mode in this case

  
  
Posted 2 years ago

Can you try to manually install it and see what you are getting?
python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl

  
  
Posted 2 years ago

I resolved the issues by making my own docker image and making all envs the same:
The env that runs clearml-agent The docker env for running tasks in The env that requests task execution (my client)

  
  
Posted 2 years ago

Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:
INFO:clearml_agent.commands.worker:Downloading " " to pip cache Collecting torch==1.12.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torch INFO:clearml_agent.commands.worker:Downloading " " to pip cache Collecting torchvision==0.13.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torchvision-0.13.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torchvision ERROR: torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform. INFO:clearml_agent.commands.worker:Traceback (most recent call last): File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/commands/worker.py", line 2893, in install_requirements_for_package_api package_api.load_requirements(cached_requirements) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/venv.py", line 41, in load_requirements super(VirtualenvPip, self).load_requirements(requirements) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/system.py", line 63, in load_requirements self.install_from_file(path) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/system.py", line 37, in install_from_file self.run_with_env(('install', '-r', path) + self.install_flags(), cwd=self.cwd) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/package/pip_api/system.py", line 90, in run_with_env return (command.get_output if output else command.check_call)(stdin=DEVNULL, env=env, **kwargs) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/process.py", line 193, in check_call return self.call_subprocess(subprocess.check_call, *args, **kwargs) File "/home/boris/.local/lib/python3.8/site-packages/clearml_agent/helper/process.py", line 240, in call_subprocess return func(list(self), *args, **kwargs) File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/home/boris/.clearml/venvs-builds/3.10/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqslrbfmwej.txt']' returned non-zero exit status 1.AgitatedDove14 Is there a way to debug what it is doing during env setup?

  
  
Posted 2 years ago

I will try soon

  
  
Posted 2 years ago

I think I understand what the issue is, you have installed the agent on your python 3.8, but it is running and trying to install on python 3.10
To verify,
pip uninstall clearml-agent
python3.10 -m pip install clearml-agent
python3.10 -m clearml-agent daemon...

  
  
Posted 2 years ago
2K Views
11 Answers
2 years ago
one year ago
Tags