I also think clearing out the venv directory before executing the change of package manager really helped.
Is there a helper function option at all that means you can flush the clearml-agent working space automatically, or by command?
The following code is the training script that was used to setup the experiment. This code has been executed on the server in a separate conda environment and verified to run fine (minus the clearml code).
` from future import print_function, division
import os, pathlib
Clear ML experiment
from clearml import Task, StorageManager, Dataset
Local modules
from cub_tools.trainer import Ignite_Trainer
from cub_tools.args import get_parser
from cub_tools.config import get_cfg_defaults
Get the arguments from the command line, including configuration file and any overrides.
parser = get_parser()
parser.print_help()
args = parser.parse_args()
#print('[INFO] Optional Arguments from CLI:: {}'.format(args.opts))
#if args.opts == '[]':
args.opts = list()
print('[INFO] Setting empty CLI args to an explicit empty list')
CLEAR ML
Tmp config load for network name
cfg = get_cfg_defaults()
cfg.merge_from_file(args.config)
Connecting with the ClearML process
task = Task.init(project_name='Caltech Birds', task_name='Train PyTorch CNN on CUB200 using Ignite [Library: '+cfg.MODEL.MODEL_LIBRARY+', Network: '+cfg.MODEL.MODEL_NAME+']', task_type=Task.TaskTypes.training)
Add the local python package as a requirement
task.add_requirements('./cub_tools')
task.add_requirements('git+ ')
Setup ability to add configuration parameters control.
params = {'TRAIN.NUM_EPOCHS': 20, 'TRAIN.BATCH_SIZE': 32, 'TRAIN.OPTIMIZER.PARAMS.lr': 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum': 0.9}
params = task.connect(params) # enabling configuration override by clearml
print(params) # printing actual configuration (after override in remote mode)
Convert Params dictionary into a set of key value pairs in a list
params_list = []
for key in params:
params_list.extend([key,params[key]])
Execute task remotely
task.execute_remotely()
Get the dataset from the clearml-server and cache locally.
print('[INFO] Getting a local copy of the CUB200 birds datasets')
Train
train_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_train_dataset')
#train_dataset.get_mutable_local_copy(target_folder='./data/images/train')
print('[INFO] Default location of training dataset:: {}'.format(train_dataset.get_default_storage()))
train_dataset_base = train_dataset.get_local_copy()
print('[INFO] Default location of training dataset:: {}'.format(train_dataset_base))
Test
test_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_test_dataset')
#train_dataset.get_mutable_local_copy(target_folder='./data/images/train')
print('[INFO] Default location of testing dataset:: {}'.format(test_dataset.get_default_storage()))
test_dataset_base = test_dataset.get_local_copy()
print('[INFO] Default location of testing dataset:: {}'.format(test_dataset_base))
Amend the input data directories and output directories for remote execution
Modify experiment root dir
params_list = params_list + ['DIRS.ROOT_DIR', '']
Add data root dir
params_list = params_list + ['DATA.DATA_DIR', str(pathlib.PurePath(train_dataset_base).parent)]
Add data train dir
params_list = params_list + ['DATA.TRAIN_DIR', str(pathlib.PurePath(train_dataset_base).name)]
Add data test dir
params_list = params_list + ['DATA.TEST_DIR', str(pathlib.PurePath(test_dataset_base).name)]
Add working dir
params_list = params_list + ['DIRS.WORKING_DIR', str(task.cache_dir)]
print('[INFO] Task output destination:: {}'.format(task.get_output_destination()))
print('[INFO] Final parameter list passed to Trainer object:: {}'.format(params_list))
Create the trainer object
trainer = Ignite_Trainer(config=args.config, cmd_args=params_list) # NOTE: disabled cmd line argument passing but using it to pass ClearML configs.
Setup the data transformers
print('[INFO] Creating data transforms...')
trainer.create_datatransforms()
Setup the dataloaders
print('[INFO] Creating data loaders...')
trainer.create_dataloaders()
Setup the model
print('[INFO] Creating the model...')
trainer.create_model()
Setup the optimizer
print('[INFO] Creating optimizer...')
trainer.create_optimizer()
Setup the scheduler
trainer.create_scheduler()
Train the model
trainer.run() `
Hi AgitatedDove14 ,
Thanks for your points.
I have updated and https://github.com/allegroai/clearml-agent/issues/66 relating to this issue.
For completeness, you were right it being an issue with PyTorch, there is a breaking issue with PyTorch 1.8.1 and Cuda 11.1 when installed via PIP.
It's recommended that PyTorch be installed with Conda to circumvent this issue.
AgitatedDove14 yes that's great.
I finally got the clearml-agent working correctly using CONDA to create the environment, and indeed, it used PIP when it couldn't find the package on CondaCloud. This is analogous to how I create python environments manually.
I have also tried training a variety of network architectures from a number of libraries (Torchvision, pytorchcv, TIMM), as well as a simple VGG implementation from scratch, and come across the same issues.
Is there a helper function option at all that means you can flush the clearml-agent working space automatically, or by command?
Every Task execution the agent clears the venv (packages are cached locally, but the actual venv is cleared). If you want you can turn on the venv cache, but there is no need to manually clear the agent's cache.
Hi VivaciousPenguin66
Seems like a CUDA/CUDNN issue.
You argent is configured to work in venvmode, which mean it will pull the correct pytorch version based on the detected CUDA driver support. Speicifally you can see in the log "agent.cuda_version = 111" which means CUDA 11.1 and from the log it found the correct pytorch version:Torch CUDA 111 download page found Found PyTorch version torch==1.8.1 matching CUDA version 111 Found PyTorch version torchvision==0.9.1 matching CUDA version 111 Collecting torch==1.8.1+cu111 File was already downloaded /home/edmorris/.clearml/pip-download-cache/cu111/torch-1.8.1+cu111-cp38-cp38-linux_x86_64.whl ...
The error itself seems like pytorch/cuda compatibility issue, not directly connected with clearml
, no?
The following was reported by the agent during the setup phase of the compute environment on the remote compute resource:
Log file is attached.
Thanks VivaciousPenguin66 !
BTW: if you are running the local code with conda, you can set the agent to use conda as well (notice that if you are running locally with pip, the agent's conda env will use pip to install the packages to avoid version mismatch)
The error after the first iteration as follows:
` [INFO] Executing model training...
1621437621593 ecm-clearml-compute-gpu-001:0 DEBUG Epoch: 0001 TrAcc: 0.296 ValAcc: 0.005 TrPrec: 0.393 ValPrec: 0.000 TrRec: 0.296 ValRec: 0.005 TrF1: 0.262 ValF1: 0.000 TrTopK: 0.613 ValTopK: 0.026 TrLoss: 3.506 ValLoss: 5.299
Current run is terminating due to exception: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x564675040c30
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 16, 128, 28, 28,
strideA = 100352, 784, 28, 1,
output: TensorDescriptor 0x564674fa4210
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 16, 128, 28, 28,
strideA = 100352, 784, 28, 1,
weight: FilterDescriptor 0x564674fa1b60
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 128, 128, 3, 3,
Pointer addresses:
input: 0x7f151ec40000
output: 0x7f1518000000
weight: 0x7f154cd2e400
Forward algorithm: 7
Engine run is terminating due to exception: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x564675040c30
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 16, 128, 28, 28,
strideA = 100352, 784, 28, 1,
output: TensorDescriptor 0x564674fa4210
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 16, 128, 28, 28,
strideA = 100352, 784, 28, 1,
weight: FilterDescriptor 0x564674fa1b60
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 128, 128, 3, 3,
Pointer addresses:
input: 0x7f151ec40000
output: 0x7f1518000000
weight: 0x7f154cd2e400
Forward algorithm: 7
Traceback (most recent call last):
File "train_clearml_pytorch_ignite_caltech_birds.py", line 104, in <module>
trainer.run()
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/cub_tools/trainer.py", line 640, in run
self.train_engine.run(self.train_loader, max_epochs=self.config.TRAIN.NUM_EPOCHS)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 702, in run
return self._internal_run()
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 775, in _internal_run
self._handle_exception(e)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
raise e
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 745, in _internal_run
time_taken = self._run_once_on_dataset()
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 850, in _run_once_on_dataset
self._handle_exception(e)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
raise e
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/engine/engine.py", line 833, in _run_once_on_dataset
self.state.output = self._process_function(self, self.state.batch)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/cub_tools/trainer.py", line 448, in train_step
y_pred = self.model(x)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torchvision/models/resnet.py", line 249, in forward
return self._forward_impl(x)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torchvision/models/resnet.py", line 238, in _forward_impl
x = self.layer2(x)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torchvision/models/resnet.py", line 74, in forward
out = self.conv2(out)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x564675040c30
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 16, 128, 28, 28,
strideA = 100352, 784, 28, 1,
output: TensorDescriptor 0x564674fa4210
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 16, 128, 28, 28,
strideA = 100352, 784, 28, 1,
weight: FilterDescriptor 0x564674fa1b60
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 128, 128, 3, 3,
Pointer addresses:
input: 0x7f151ec40000
output: 0x7f1518000000
weight: 0x7f154cd2e400
Forward algorithm: 7
1621437626467 ecm-clearml-compute-gpu-001:0 DEBUG Process failed, exit code 1 `