I Am Having An Issue Publishing A Completed Model Training. The Model Has Been Deployed On Remote Compute, Using A Docker Image, And The Datasets Have Been Served From An Azure Blob Storage Account. The Model Trains Successfully, And Completes, After The

Answered

I am having an issue publishing a completed model training.
The model has been deployed on remote compute, using a docker image, and the datasets have been served from an Azure Blob Storage account.

The model trains successfully, and completes, after the PyTorch Ignite early termination callback detects the model doesn't improve after a set number of iterations. When I got to publish the experiment I get a non-descript UI error message.

A couple of additional things I have been trying since coming across the this problem:

Nested project structure - I have begun using different sub projects for different purposes. The model was trained in "MASTER PROJECT/TRAINING", rather than just "MASTER PROJECT". I am also looking into model deployment using clearml-serving , therefore the trainer class I have written now executes a method following training which creates the config.pbtxt file for the Triton inference server.
The terminal output of the final stages of model training are as follows:

2021-06-07 13:03:19 Epoch: 0023 TrAcc: 0.815 ValAcc: 0.778 TrPrec: 0.825 ValPrec: 0.783 TrRec: 0.815 ValRec: 0.778 TrF1: 0.815 ValF1: 0.776 TrTopK: 0.927 ValTopK: 0.955 TrLoss: 0.842 ValLoss: 0.812 2021-06-07 13:04:30 Epoch: 0024 TrAcc: 0.813 ValAcc: 0.776 TrPrec: 0.825 ValPrec: 0.781 TrRec: 0.813 ValRec: 0.778 TrF1: 0.814 ValF1: 0.774 TrTopK: 0.930 ValTopK: 0.954 TrLoss: 0.832 ValLoss: 0.814 2021-06-07 13:05:40 Epoch: 0025 TrAcc: 0.825 ValAcc: 0.779 TrPrec: 0.833 ValPrec: 0.783 TrRec: 0.825 ValRec: 0.780 TrF1: 0.825 ValF1: 0.777 TrTopK: 0.935 ValTopK: 0.955 TrLoss: 0.794 ValLoss: 0.809 2021-06-07 13:06:55 2021-06-07 12:06:51,427 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training Epoch: 0026 TrAcc: 0.813 ValAcc: 0.779 TrPrec: 0.822 ValPrec: 0.784 TrRec: 0.813 ValRec: 0.780 TrF1: 0.813 ValF1: 0.777 TrTopK: 0.928 ValTopK: 0.955 TrLoss: 0.833 ValLoss: 0.814 [INFO] Model training is complete. [INFO] Creating deployment configuration... 2021-06-07 12:06:55,361 - clearml - WARNING - Could not retrieve remote configuration named 'config.pbtxt' Using default configuration: config.pbtxt 2021-06-07 13:07:00 Process completed successfullyThere is a warning regarding the config.pbtxt however when checking the experiment configurations, it looks as though the class method I have written to create the configuration file has been created correctly.

platform: "pytorch_libtorch" input [ { name: "input_layer" data_type: TYPE_FP32 dims: [ 3, 224, 224 ] } ] output [ { name: "fc" data_type: TYPE_FP32 dims: [ 200 ] } ]

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Votes Newest

Answers 8

FYI, I am training the model again, this time in a project which is not nested, just to rule out any funnies with regards to issues with nested projects.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Hi SuccessfulKoala55
Thanks for the input.
I was actually about to grab the new docker_compose.yml and pull the new images.
Weirdly it was working before, so what's changed?
I don't believe I've updated the agents or the clearml sdk on the experiment submission vm either.
I will definitely update the server now, and report back.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Well, I'm not sure, but this error is related to a null value sent as the task's container field (which should be perfectly legal, of course)

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I checked the apiserver.log file in /opt/clearml/logs and this appears to be the related error when I try to publish an experiment:

` [2021-06-07 13:43:40,239] [9] [ERROR] [clearml.service_repo] ValidationError (Task:8a4a13bad8334d8bb53d7edb61671ba9) (setup_shell_script.StringField only accepts string values: ['container'])
Traceback (most recent call last):
File "/opt/clearml/apiserver/bll/task/task_operations.py", line 325, in publish_task
raise ex
File "/opt/clearml/apiserver/bll/task/task_operations.py", line 301, in publish_task
task.save()
File "/usr/local/lib/python3.6/site-packages/mongoengine/document.py", line 392, in save
self.validate(clean=clean)
File "/usr/local/lib/python3.6/site-packages/mongoengine/base/document.py", line 450, in validate
raise ValidationError(message, errors=errors)
mongoengine.errors.ValidationError: ValidationError (Task:8a4a13bad8334d8bb53d7edb61671ba9) (setup_shell_script.StringField only accepts string values: ['container'])

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/clearml/apiserver/service_repo/service_repo.py", line 277, in handle_call
ret = endpoint.func(call, company, call.data_model)
File "/opt/clearml/apiserver/services/tasks.py", line 1131, in publish_many
ids=request.ids,
File "/opt/clearml/apiserver/bll/util.py", line 122, in run_batch_operation
results.append((_id, func(_id)))
File "/opt/clearml/apiserver/bll/task/task_operations.py", line 329, in publish_task
task.save()
File "/usr/local/lib/python3.6/site-packages/mongoengine/document.py", line 392, in save
self.validate(clean=clean)
File "/usr/local/lib/python3.6/site-packages/mongoengine/base/document.py", line 450, in validate
raise ValidationError(message, errors=errors)
mongoengine.errors.ValidationError: ValidationError (Task:8a4a13bad8334d8bb53d7edb61671ba9) (setup_shell_script.StringField only accepts string values: ['container']) `

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Hi VivaciousPenguin66 , this looks like an internal error indeed...

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Are you using a self-hosted server? If so, what's the version? I have a feeling you're running v1.0.1 or v1.0.0 (as the "newer version" message on the top indicates). This error looks exactly like what was fixed on v1.0.2... (see https://clear.ml/docs/latest/docs/release_notes/ver_1_0#clearml-server-102 )

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Can you check the browser's "Developer Tools/Network" section and see the exactly API call that's failing? (including the payload sent ion the request)

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55
Good news!
It looks like pulling the new clearml-server version has solved the problem.
I can happily publish models.

Interestingly, I was able to publish models before using this server, so I must have inadvertently updated something that has caused a conflict.

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Write your answer

2K Views

8 Answers

3 years ago

2 years ago