Hello, I'M Getting This Weird Error From Time To Time When Running A Pipeline, It Add My Tasks As Drafts But Never Launch Them, When I Checked The Logs, I See The Following ;

Answered

Hello,
I'm getting this weird error from time to time when running a pipeline, it add my tasks as drafts but never launch them, when I checked the logs, I see the following ;
launch step one 2022-02-25 13:46:31,253 - clearml.Task - ERROR - Action failed <500/100: events.get_task_events/v1.0 (General data error (NotFoundError(404, 'index_not_found_exception', 'no such index [events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b]', events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b, index_or_alias)))> (task=945ff9ec87904964a0c7763467033e26, order=asc, batch_size=100, event_type=training_debug_image) 2022-02-25 13:46:31,253 - clearml.Task - ERROR - Task deletion failed: Action failed <500/100: events.get_task_events/v1.0 (General data error (NotFoundError(404, 'index_not_found_exception', 'no such index [events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b]', events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b, index_or_alias)))> (task=945ff9ec87904964a0c7763467033e26, order=asc, batch_size=100, event_type=training_debug_image) launch step two 2022-02-25 13:46:31,417 - clearml.Task - ERROR - Action failed <500/100: events.get_task_events/v1.0 (General data error (NotFoundError(404, 'index_not_found_exception', 'no such index [events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b]', events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b, index_or_alias)))> (task=88be3bfc9e784a5d8cfb7836e22ed3f3, order=asc, batch_size=100, event_type=training_debug_image) 2022-02-25 13:46:31,417 - clearml.Task - ERROR - Task deletion failed: Action failed <500/100: events.get_task_events/v1.0 (General data error (NotFoundError(404, 'index_not_found_exception', 'no such index [events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b]', events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b, index_or_alias)))> (task=88be3bfc9e784a5d8cfb7836e22ed3f3, order=asc, batch_size=100, event_type=training_debug_image) launch step three 2022-02-25 13:46:31,684 - clearml.Task - ERROR - Action failed <500/100: events.get_task_events/v1.0 (General data error (NotFoundError(404, 'index_not_found_exception', 'no such index [events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b]', events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b, index_or_alias)))> (task=aa026690cdbc46a9bef3c53764e2dda7, order=asc, batch_size=100, event_type=training_debug_image) 2022-02-25 13:46:31,684 - clearml.Task - ERROR - Task deletion failed: Action failed <500/100: events.get_task_events/v1.0 (General data error (NotFoundError(404, 'index_not_found_exception', 'no such index [events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b]', events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b, index_or_alias)))> (task=aa026690cdbc46a9bef3c53764e2dda7, order=asc, batch_size=100, event_type=training_debug_image) 2022-02-25 14:46:37 pipeline completed with model: <xgboost.core.Booster object at 0x7f9e85a45a90> 2022-02-25 13:46:32,061 - clearml.Task - INFO - Waiting to finish uploads 2022-02-25 14:46:42 2022-02-25 13:46:41,899 - clearml.Task - INFO - Finished uploading

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BulkyTiger31
				
					0
					 × 1

Votes Newest

Answers 14

I think the issue is coming from task caching, because once I deactivated it, it starts working again

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BulkyTiger31
				
					0
					 × 1

AgitatedDove14 on the logs, I see nothing out of the ordinary, and I tried redeploying the container and removing the persistence volume attached to it, but I still got the same error

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BulkyTiger31
				
					0
					 × 1

the same one from :
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BulkyTiger31
				
					0
					 × 1

I have clearml running on a k8s cluster

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BulkyTiger31
				
					0
					 × 1

it's the same error I'm getting on clearml dashboard

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BulkyTiger31
				
					0
					 × 1

and I'm executing the pipeline script locally

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BulkyTiger31
				
					0
					 × 1

BulkyTiger31 could it be there is some issue with the elastic container ?
Can you see any experiment's metrics ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

all the messages are like that

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BulkyTiger31
				
					0
					 × 1

CostlyOstrich36 on the pipeline decorator, there is a parameter cache , I disabled it.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BulkyTiger31
				
					0
					 × 1

How did you do that?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Can you give an example of a pipeline to play with?
Are you running self deployed?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

actually on the logs of apiserver I see this :

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BulkyTiger31
				
					0
					 × 1

Yep... some went wrong with the elastic container, I think it lost it's indexes (or they got screwed somehow)
Do you have a backup of the persistence volume attached to the container? Can you try restoring it?

I would restart the entire clearml-server (docker-compose), then can you post here the startup logs? It should provide some info on what's wrong

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Can you gain access to the apiserver logs?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

2K Views

14 Answers

3 years ago

2 years ago