During Our First Large Hyperpameter Run, We Have Noticed That There Are Some Tasks That Get Aborted With The Following Console Log:

Answered

During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log: User aborted: stopping task (3)

Does anyone know what is this 3 error code and what could cause some random experiments get aborted?? When I am stopping the task from the Web UI it gets 1. If we restart the same job from the WebUI it completes without any issues, so it seems quite random. Machine resources are never too high during these random aborts, so we dont have any idea.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GhastlySeaurchin98
				
					0
					 × 1

Votes Newest

Answers 12

Hi @<1541592204353474560:profile|GhastlySeaurchin98> , how are you running the experiments - which type of machines - local or cloud? Are you running your own server or using the community?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

we are running it locally on a self hosted server with a single 3080Ti. ClearML server and worker is on the same machine

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GhastlySeaurchin98
				
					0
					 × 1

We are using community version I belive, since we are not paying.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GhastlySeaurchin98
				
					0
					 × 1

did you mean this by "own server or community"?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GhastlySeaurchin98
				
					0
					 × 1

It looks like you're on a self hosted server, the community server is app.clear.ml where you can just sign up and don't have to maintain your own server 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi @<1541592204353474560:profile|GhastlySeaurchin98>

During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log:

This looks like the HPO algorithm doing early stopping, which algo are you using ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I am using Optuna

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GhastlySeaurchin98
				
					0
					 × 1

@<1523701070390366208:profile|CostlyOstrich36> yeah then we are definately on self-hosted.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GhastlySeaurchin98
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> yeah, just checked my code and here's a snippet which uses EarlyStopping :

        history = regressor.fit(
            x=train_generator,
            validation_data=test_generator,
            epochs=epochs,
            callbacks=[
                board,
                LearningRateScheduler(scheduler),
                EarlyStopping(monitor="loss", patience=20, restore_best_weights=True),
            ],
        )

But does it mean it will abort the whole experiment? Its a problem since we are fitting multiple models in a single experiment.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GhastlySeaurchin98
				
					0
					 × 1

@<1541954614918647808:profile|EmaciatedCormorant16> FYI

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GhastlySeaurchin98
				
					0
					 × 1

@<1541592204353474560:profile|GhastlySeaurchin98> , I think this is more related to how Optuna works, it aborts the experiment. I think you would need to modify something in order for it to run the way you want

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

@<1523701070390366208:profile|CostlyOstrich36> thanks for your insight. then we will probably check Optuna's code and see if we can eliminate this behaviour, if not we are gonna rethink our workflow

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GhastlySeaurchin98
				
					0
					 × 1

Write your answer

2K Views

12 Answers

2 years ago