Answered

When I Setup My Local Virtual Environment I Use A Combination Of Conda And Pip. I Use Conda As My Environment Manager, And Then Use Pip For Packages That Are Not In The Conda Repositories.

When I setup my local virtual environment I use a combination of Conda and pip. I use conda as my environment manager, and then use pip for packages that are not in the conda repositories.

I am having an issue getting a PyTorch model to train on a remote compute server using clearml and I am wondering if it is something to do with the virtual environment. I have tried all three methods pip venv, conda venv and docker. The closest have something working is using the default pip package venv method. The model runs for an iteration and then crashes with cudnn error. I can’t help thinking this is something to do with the pip installation of PyTorch as they recommended using the conda channel to install it.

I am running this on a bespoke azure vm image that I created, with 10.1, 10.2, 11.1 and 11.2 CUDA versions installed, and the correct CUDNN libraries supported by the latest nvidia driver. I have verified that the CUDA drivers are working as I have been able to train models in conda environments directly on the machine, but not using clearml. I use the same trainer classes that I have written using Ignite, that I have used with the clearml experiments.

I have made sure that the CUDA version is correctly specified for the version being used by the PyTorch installation, and I have been running the clearml-agent in its own conda environment, with the environment variables pointing towards the correct version CUDA, in this case 11.1. I have verified that the different versions of CUDA work properly by setting up conda environments with the different versions of PyTorch and successfully trained models using 10.2, 11.1 and 11.2, outside of clearml

However I run into the same cudnn error on the forward calculation of first iteration when I run it through clearml.
I still post some details of errors and environments later, but I just wanted to hear if people have had issues getting PyTorch models training with clearml.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Votes Newest

Answers 10

The following code is the training script that was used to setup the experiment. This code has been executed on the server in a separate conda environment and verified to run fine (minus the clearml code).

` from future import print_function, division
import os, pathlib

Clear ML experiment

from clearml import Task, StorageManager, Dataset

Local modules

from cub_tools.trainer import Ignite_Trainer
from cub_tools.args import get_parser
from cub_tools.config import get_cfg_defaults

Get the arguments from the command line, including configuration file and any overrides.

parser = get_parser()
parser.print_help()
args = parser.parse_args()

#print('[INFO] Optional Arguments from CLI:: {}'.format(args.opts))
#if args.opts == '[]':

args.opts = list()

print('[INFO] Setting empty CLI args to an explicit empty list')

CLEAR ML

Tmp config load for network name

cfg = get_cfg_defaults()
cfg.merge_from_file(args.config)

Connecting with the ClearML process

task = Task.init(project_name='Caltech Birds', task_name='Train PyTorch CNN on CUB200 using Ignite [Library: '+cfg.MODEL.MODEL_LIBRARY+', Network: '+cfg.MODEL.MODEL_NAME+']', task_type=Task.TaskTypes.training)

Add the local python package as a requirement

task.add_requirements('./cub_tools')
task.add_requirements('git+ ')

Setup ability to add configuration parameters control.

params = {'TRAIN.NUM_EPOCHS': 20, 'TRAIN.BATCH_SIZE': 32, 'TRAIN.OPTIMIZER.PARAMS.lr': 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum': 0.9}
params = task.connect(params) # enabling configuration override by clearml
print(params) # printing actual configuration (after override in remote mode)

Convert Params dictionary into a set of key value pairs in a list

params_list = []
for key in params:
params_list.extend([key,params[key]])

Execute task remotely

task.execute_remotely()

Get the dataset from the clearml-server and cache locally.

print('[INFO] Getting a local copy of the CUB200 birds datasets')

Train

train_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_train_dataset')
#train_dataset.get_mutable_local_copy(target_folder='./data/images/train')
print('[INFO] Default location of training dataset:: {}'.format(train_dataset.get_default_storage()))
train_dataset_base = train_dataset.get_local_copy()
print('[INFO] Default location of training dataset:: {}'.format(train_dataset_base))

Test

test_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_test_dataset')
#train_dataset.get_mutable_local_copy(target_folder='./data/images/train')
print('[INFO] Default location of testing dataset:: {}'.format(test_dataset.get_default_storage()))
test_dataset_base = test_dataset.get_local_copy()
print('[INFO] Default location of testing dataset:: {}'.format(test_dataset_base))

Amend the input data directories and output directories for remote execution

Modify experiment root dir

params_list = params_list + ['DIRS.ROOT_DIR', '']

Add data root dir

params_list = params_list + ['DATA.DATA_DIR', str(pathlib.PurePath(train_dataset_base).parent)]

Add data train dir

params_list = params_list + ['DATA.TRAIN_DIR', str(pathlib.PurePath(train_dataset_base).name)]

Add data test dir

params_list = params_list + ['DATA.TEST_DIR', str(pathlib.PurePath(test_dataset_base).name)]

Add working dir

params_list = params_list + ['DIRS.WORKING_DIR', str(task.cache_dir)]
print('[INFO] Task output destination:: {}'.format(task.get_output_destination()))

print('[INFO] Final parameter list passed to Trainer object:: {}'.format(params_list))

Create the trainer object

trainer = Ignite_Trainer(config=args.config, cmd_args=params_list) # NOTE: disabled cmd line argument passing but using it to pass ClearML configs.

Setup the data transformers

print('[INFO] Creating data transforms...')
trainer.create_datatransforms()

Setup the dataloaders

print('[INFO] Creating data loaders...')
trainer.create_dataloaders()

Setup the model

print('[INFO] Creating the model...')
trainer.create_model()

Setup the optimizer

print('[INFO] Creating optimizer...')
trainer.create_optimizer()

Setup the scheduler

trainer.create_scheduler()

Train the model

trainer.run() `

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

I have also tried training a variety of network architectures from a number of libraries (Torchvision, pytorchcv, TIMM), as well as a simple VGG implementation from scratch, and come across the same issues.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

I also think clearing out the venv directory before executing the change of package manager really helped.

Is there a helper function option at all that means you can flush the clearml-agent working space automatically, or by command?

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Hi VivaciousPenguin66
Seems like a CUDA/CUDNN issue.
You argent is configured to work in venvmode, which mean it will pull the correct pytorch version based on the detected CUDA driver support. Speicifally you can see in the log "agent.cuda_version = 111" which means CUDA 11.1 and from the log it found the correct pytorch version:
Torch CUDA 111 download page found Found PyTorch version torch==1.8.1 matching CUDA version 111 Found PyTorch version torchvision==0.9.1 matching CUDA version 111 Collecting torch==1.8.1+cu111 File was already downloaded /home/edmorris/.clearml/pip-download-cache/cu111/torch-1.8.1+cu111-cp38-cp38-linux_x86_64.whl ...The error itself seems like pytorch/cuda compatibility issue, not directly connected with clearml , no?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 ,

Thanks for your points.

I have updated and https://github.com/allegroai/clearml-agent/issues/66 relating to this issue.

For completeness, you were right it being an issue with PyTorch, there is a breaking issue with PyTorch 1.8.1 and Cuda 11.1 when installed via PIP.

It's recommended that PyTorch be installed with Conda to circumvent this issue.

https://github.com/pytorch/pytorch/issues/56747

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

The following was reported by the agent during the setup phase of the compute environment on the remote compute resource:
Log file is attached.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Thanks VivaciousPenguin66 !
BTW: if you are running the local code with conda, you can set the agent to use conda as well (notice that if you are running locally with pip, the agent's conda env will use pip to install the packages to avoid version mismatch)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is there a helper function option at all that means you can flush the clearml-agent working space automatically, or by command?

Every Task execution the agent clears the venv (packages are cached locally, but the actual venv is cleared). If you want you can turn on the venv cache, but there is no need to manually clear the agent's cache.

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 yes that's great.
I finally got the clearml-agent working correctly using CONDA to create the environment, and indeed, it used PIP when it couldn't find the package on CondaCloud. This is analogous to how I create python environments manually.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

The error after the first iteration as follows:

` [INFO] Executing model training...

1621437621593 ecm-clearml-compute-gpu-001:0 DEBUG Epoch: 0001 TrAcc: 0.296 ValAcc: 0.005 TrPrec: 0.393 ValPrec: 0.000 TrRec: 0.296 ValRec: 0.005 TrF1: 0.262 ValF1: 0.000 TrTopK: 0.613 ValTopK: 0.026 TrLoss: 3.506 ValLoss: 5.299
Current run is terminating due to exception: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x564675040c30
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 16, 128, 28, 28,
strideA = 100352, 784, 28, 1,
output: TensorDescriptor 0x564674fa4210
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 16, 128, 28, 28,
strideA = 100352, 784, 28, 1,
weight: FilterDescriptor 0x564674fa1b60
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 128, 128, 3, 3,
Pointer addresses:
input: 0x7f151ec40000
output: 0x7f1518000000
weight: 0x7f154cd2e400
Forward algorithm: 7

Engine run is terminating due to exception: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

Traceback (most recent call last):
File "train_clearml_pytorch_ignite_caltech_birds.py", line 104, in <module>
trainer.run()
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/cub_tools/trainer.py", line 640, in run
self.train_engine.run(self.train_loader, max_epochs=self.config.TRAIN.NUM_EPOCHS)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 702, in run
return self._internal_run()
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 775, in _internal_run
self._handle_exception(e)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
raise e
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 745, in _internal_run
time_taken = self._run_once_on_dataset()
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 850, in _run_once_on_dataset
self._handle_exception(e)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
raise e
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 833, in _run_once_on_dataset
self.state.output = self._process_function(self, self.state.batch)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/cub_tools/trainer.py", line 448, in train_step
y_pred = self.model(x)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torchvision/models/resnet.py", line 249, in forward
return self._forward_impl(x)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torchvision/models/resnet.py", line 238, in _forward_impl
x = self.layer2(x)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torchvision/models/resnet.py", line 74, in forward
out = self.conv2(out)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

1621437626467 ecm-clearml-compute-gpu-001:0 DEBUG Process failed, exit code 1 `

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Write your answer

2K Views

10 Answers

3 years ago

2 years ago