I Have Been Successfully Deploying And Training A Pytorch Cnn On A

Answered

I have been successfully deploying and training a PyTorch CNN on a clearml-agent managed compute resource and have been testing some the capabilities, including scaling. So I have fired up another compute note, and started a clearml-agent running in the same queue to double the amount of GPU resource. I have a number of jobs sitting in the queue waiting to be trained, which are basically different network architectures on the same problem.

When the first job runs on the new compute node, it has to cache a copy of the images dataset for the problem from the clearml-server, ready for execution of the training job. This all happens as expected and the job has started training. However I keep getting the following reported to the terminal logging output:

2021-05-21 09:51:36,268 - clearml.Metrics - ERROR - Action failed <400/131: events.add_batch/v1.0 (Events not added: Invalid task id=1)>

It doesn't appear to be impacting the job or its reporting, as the metrics of the training appear to be being collected as normal on the original compute node I have been testing on.

Q. I was wondering if there was a setting required or a call relating to the dataset downloading that needed to stop the job from logging until the model is actually executed for training?

Q. Is this is a timeout issue, as it has to wait for the dataset and pre-trained model weights to download?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Votes Newest

Answers 6

I've not seen this before.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

This job did download pre-trained weights, so the only difference between them is the local dataset cache.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

SuccessfulKoala55 A second queued job which executed on the same node, but didn't this time need to cache the dataset locally as it was done by the previous experiment, hasn't had this issue.

That all being said, apart from the console reporting looking messy, it doesn't appear to have impacted the training, or indeed the metric collection of the first experiment where it occurred.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

` Starting Task Execution:

usage: train_clearml_pytorch_ignite_caltech_birds.py [-h] [--config FILE]
[--opts ...]

optional arguments:
-h, --help show this help message and exit
--config FILE Path and name of configuration file for training. Should be a
.yaml file.
--opts ... Modify config options using the command-line 'KEY VALUE'
pairs
ClearML results page:
{'MODEL': {'MODEL_LIBRARY': 'timm', 'MODEL_NAME': 'res2net101_26w_4s', 'PRETRAINED': True, 'WITH_AMP': False, 'WITH_GRAD_SCALE': False}, 'TRAIN': {'BATCH_SIZE': 16, 'NUM_WORKERS': 4, 'NUM_EPOCHS': 40, 'LOSS': {'CRITERION': 'CrossEntropy'}, 'OPTIMIZER': {'TYPE': 'SGD', 'PARAMS': {'lr': 0.001, 'momentum': 0.9, 'nesterov': True}}, 'SCHEDULER': {'TYPE': 'StepLR', 'PARAMS': {'step_size': 7, 'gamma': 0.1}}}, 'EARLY_STOPPING_PATIENCE': 5, 'DIRS': {'ROOT_DIR': '/home/edmorris/projects/image_classification/caltech_birds', 'WORKING_DIR': 'models/classification', 'CLEAN_UP': True}, 'DATA': {'DATA_DIR': 'data/images', 'TRAIN_DIR': 'train', 'TEST_DIR': 'test', 'NUM_CLASSES': 200, 'TRANSFORMS': {'TYPE': 'default', 'PARAMS': {'DEFAULT': {'img_crop_size': 224, 'img_resize': 256}, 'AGGRESIVE': {'type': 'all', 'persp_distortion_scale': 0.25, 'rotation_range': (-10.0, 10.0)}}}}, 'SYSTEM': {'LOG_HISTORY': True}}
[INFO] Getting a local copy of the CUB200 birds datasets
[INFO] Default location of training dataset::
[INFO] Default location of training dataset:: /home/edmorris/.clearml/cache/storage_manager/datasets/ds_0ccff21334e84b3d8e0618c5f1734cc8

1621593640195 ecm-clearml-compute-gpu-002:gpuall DEBUG [INFO] Default location of testing dataset::
[INFO] Default location of testing dataset:: /home/edmorris/.clearml/cache/storage_manager/datasets/ds_b435c4ffda374bca83d9a746137dc3ca
[INFO] Task output destination:: None
[INFO] Final parameter list passed to Trainer object:: ['MODEL', {'MODEL_LIBRARY': 'timm', 'MODEL_NAME': 'res2net101_26w_4s', 'PRETRAINED': True, 'WITH_AMP': False, 'WITH_GRAD_SCALE': False}, 'TRAIN', {'BATCH_SIZE': 16, 'NUM_WORKERS': 4, 'NUM_EPOCHS': 40, 'LOSS': {'CRITERION': 'CrossEntropy'}, 'OPTIMIZER': {'TYPE': 'SGD', 'PARAMS': {'lr': 0.001, 'momentum': 0.9, 'nesterov': True}}, 'SCHEDULER': {'TYPE': 'StepLR', 'PARAMS': {'step_size': 7, 'gamma': 0.1}}}, 'EARLY_STOPPING_PATIENCE', 5, 'DIRS', {'ROOT_DIR': '/home/edmorris/projects/image_classification/caltech_birds', 'WORKING_DIR': 'models/classification', 'CLEAN_UP': True}, 'DATA', {'DATA_DIR': 'data/images', 'TRAIN_DIR': 'train', 'TEST_DIR': 'test', 'NUM_CLASSES': 200, 'TRANSFORMS': {'TYPE': 'default', 'PARAMS': {'DEFAULT': {'img_crop_size': 224, 'img_resize': 256}, 'AGGRESIVE': {'type': 'all', 'persp_distortion_scale': 0.25, 'rotation_range': (-10.0, 10.0)}}}}, 'SYSTEM', {'LOG_HISTORY': True}, 'DIRS.ROOT_DIR', '', 'DATA.DATA_DIR', '/home/edmorris/.clearml/cache/storage_manager/datasets', 'DATA.TRAIN_DIR', 'ds_0ccff21334e84b3d8e0618c5f1734cc8', 'DATA.TEST_DIR', 'ds_b435c4ffda374bca83d9a746137dc3ca', 'DIRS.WORKING_DIR', '/home/edmorris/.clearml/cache/7af65e74ebc144b4949c6ef5880b1dec']
[INFO] Parameters Override:: ['MODEL', {'MODEL_LIBRARY': 'timm', 'MODEL_NAME': 'res2net101_26w_4s', 'PRETRAINED': True, 'WITH_AMP': False, 'WITH_GRAD_SCALE': False}, 'TRAIN', {'BATCH_SIZE': 16, 'NUM_WORKERS': 4, 'NUM_EPOCHS': 40, 'LOSS': {'CRITERION': 'CrossEntropy'}, 'OPTIMIZER': {'TYPE': 'SGD', 'PARAMS': {'lr': 0.001, 'momentum': 0.9, 'nesterov': True}}, 'SCHEDULER': {'TYPE': 'StepLR', 'PARAMS': {'step_size': 7, 'gamma': 0.1}}}, 'EARLY_STOPPING_PATIENCE', 5, 'DIRS', {'ROOT_DIR': '/home/edmorris/projects/image_classification/caltech_birds', 'WORKING_DIR': 'models/classification', 'CLEAN_UP': True}, 'DATA', {'DATA_DIR': 'data/images', 'TRAIN_DIR': 'train', 'TEST_DIR': 'test', 'NUM_CLASSES': 200, 'TRANSFORMS': {'TYPE': 'default', 'PARAMS': {'DEFAULT': {'img_crop_size': 224, 'img_resize': 256}, 'AGGRESIVE': {'type': 'all', 'persp_distortion_scale': 0.25, 'rotation_range': (-10.0, 10.0)}}}}, 'SYSTEM', {'LOG_HISTORY': True}, 'DIRS.ROOT_DIR', '', 'DATA.DATA_DIR', '/home/edmorris/.clearml/cache/storage_manager/datasets', 'DATA.TRAIN_DIR', 'ds_0ccff21334e84b3d8e0618c5f1734cc8', 'DATA.TEST_DIR', 'ds_b435c4ffda374bca83d9a746137dc3ca', 'DIRS.WORKING_DIR', '/home/edmorris/.clearml/cache/7af65e74ebc144b4949c6ef5880b1dec']
DATA:
DATA_DIR: /home/edmorris/.clearml/cache/storage_manager/datasets
NUM_CLASSES: 200
TEST_DIR: ds_b435c4ffda374bca83d9a746137dc3ca
TRAIN_DIR: ds_0ccff21334e84b3d8e0618c5f1734cc8
TRANSFORMS:
PARAMS:
AGGRESIVE:
persp_distortion_scale: 0.25
rotation_range: (-10.0, 10.0)
type: all
DEFAULT:
img_crop_size: 224
img_resize: 256
TYPE: default
DIRS:
CLEAN_UP: True
ROOT_DIR:
WORKING_DIR: /home/edmorris/.clearml/cache/7af65e74ebc144b4949c6ef5880b1dec/ignite_res2net101_26w_4s
EARLY_STOPPING_PATIENCE: 5
MODEL:
MODEL_LIBRARY: timm
MODEL_NAME: res2net101_26w_4s
PRETRAINED: True
WITH_AMP: False
WITH_GRAD_SCALE: False
SYSTEM:
LOG_HISTORY: True
TRAIN:
BATCH_SIZE: 16
LOSS:
CRITERION: CrossEntropy
NUM_EPOCHS: 40
NUM_WORKERS: 4
OPTIMIZER:
PARAMS:
lr: 0.001
momentum: 0.9
nesterov: True
TYPE: SGD
SCHEDULER:
PARAMS:
gamma: 0.1
step_size: 7
TYPE: StepLR
[INFO] Creating data transforms...
[INFO] Creating data loaders...

** DATASET SUMMARY **

train size:: 5994 images
test size:: 5794 images
Number of classes:: 200

[INFO] Created data loaders.
[INFO] Creating the model...
Downloading: " " to /home/edmorris/.cache/torch/hub/checkpoints/res2net101_26w_4s-02a759a1.pth

1621593645197 ecm-clearml-compute-gpu-002:gpuall DEBUG [INFO] Successfully created model and pushed it to the device cuda:0
[INFO] Creating optimizer...
[INFO] Successfully created optimizer object.
[INFO] Successfully created learning rate scheduler object.
[INFO] Trainer pass OK for training.
Tensorboard Logging...done
[INFO] Creating callback functions for training loop...Early Stopping (5 epochs).../home/edmorris/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/ignite/contrib/handlers/clearml_logger.py:659: UserWarning:

ClearMLSaver created a temporary checkpoints directory: /tmp/ignite_checkpoints_2021_05_21_10_40_43_y89b0b1k

Model Checkpointing...Done
[INFO] Executing model training...

1621593881321 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0001 TrAcc: 0.210 ValAcc: 0.218 TrPrec: 0.250 ValPrec: 0.235 TrRec: 0.210 ValRec: 0.220 TrF1: 0.159 ValF1: 0.161 TrTopK: 0.519 ValTopK: 0.543 TrLoss: 3.949 ValLoss: 3.770

1621594122525 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0002 TrAcc: 0.353 ValAcc: 0.366 TrPrec: 0.435 ValPrec: 0.419 TrRec: 0.353 ValRec: 0.368 TrF1: 0.321 ValF1: 0.318 TrTopK: 0.722 ValTopK: 0.757 TrLoss: 2.832 ValLoss: 2.600

1621594363697 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0003 TrAcc: 0.485 ValAcc: 0.497 TrPrec: 0.542 ValPrec: 0.538 TrRec: 0.485 ValRec: 0.496 TrF1: 0.449 ValF1: 0.457 TrTopK: 0.817 ValTopK: 0.849 TrLoss: 2.142 ValLoss: 1.930

1621594609829 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0004 TrAcc: 0.555 ValAcc: 0.555 TrPrec: 0.610 ValPrec: 0.591 TrRec: 0.555 ValRec: 0.557 TrF1: 0.528 ValF1: 0.519 TrTopK: 0.865 ValTopK: 0.893 TrLoss: 1.773 ValLoss: 1.586

1621594851050 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0005 TrAcc: 0.589 ValAcc: 0.592 TrPrec: 0.646 ValPrec: 0.644 TrRec: 0.589 ValRec: 0.594 TrF1: 0.567 ValF1: 0.562 TrTopK: 0.878 ValTopK: 0.910 TrLoss: 1.532 ValLoss: 1.374

1621595097293 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0006 TrAcc: 0.661 ValAcc: 0.655 TrPrec: 0.705 ValPrec: 0.687 TrRec: 0.661 ValRec: 0.657 TrF1: 0.648 ValF1: 0.639 TrTopK: 0.899 ValTopK: 0.934 TrLoss: 1.310 ValLoss: 1.174

1621595338441 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0007 TrAcc: 0.693 ValAcc: 0.676 TrPrec: 0.732 ValPrec: 0.717 TrRec: 0.692 ValRec: 0.679 TrF1: 0.682 ValF1: 0.661 TrTopK: 0.902 ValTopK: 0.934 TrLoss: 1.199 ValLoss: 1.097

1621595579621 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0008 TrAcc: 0.755 ValAcc: 0.739 TrPrec: 0.763 ValPrec: 0.755 TrRec: 0.755 ValRec: 0.740 TrF1: 0.746 ValF1: 0.729 TrTopK: 0.925 ValTopK: 0.951 TrLoss: 1.007 ValLoss: 0.941

1621595825749 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0009 TrAcc: 0.768 ValAcc: 0.748 TrPrec: 0.781 ValPrec: 0.753 TrRec: 0.768 ValRec: 0.749 TrF1: 0.763 ValF1: 0.739 TrTopK: 0.925 ValTopK: 0.952 TrLoss: 0.978 ValLoss: 0.909

1621596066957 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0010 TrAcc: 0.772 ValAcc: 0.753 TrPrec: 0.782 ValPrec: 0.755 TrRec: 0.772 ValRec: 0.754 TrF1: 0.768 ValF1: 0.745 TrTopK: 0.924 ValTopK: 0.956 TrLoss: 0.958 ValLoss: 0.869

1621596313317 ecm-clearml-compute-gpu-002:gpuall DEBUG Epoch: 0011 TrAcc: 0.795 ValAcc: 0.761 TrPrec: 0.811 ValPrec: 0.766 TrRec: 0.795 ValRec: 0.762 TrF1: 0.793 ValF1: 0.754 TrTopK: 0.932 ValTopK: 0.954 TrLoss: 0.916 ValLoss: 0.871 `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

SuccessfulKoala55 However, this was the first time an experiment with this dataset was executed on this compute node. I have been doing a lot of trial and error with this setup to get the models training, and so on my first compute node, I had the data downloading locally quite early on, so I haven't seen the script have to download a local dataset cache as it was already done.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Hi VivaciousPenguin66 , this error indicated the SDK attempted to report metrics for a non-existent task ID (1), bit I'm not sure why...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

6 Answers

4 years ago

2 years ago