Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
During Our First Large Hyperpameter Run, We Have Noticed That There Are Some Tasks That Get Aborted With The Following Console Log:

During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log: User aborted: stopping task (3)

Does anyone know what is this 3 error code and what could cause some random experiments get aborted?? When I am stopping the task from the Web UI it gets 1. If we restart the same job from the WebUI it completes without any issues, so it seems quite random. Machine resources are never too high during these random aborts, so we dont have any idea.

  
  
Posted one year ago
Votes Newest

Answers 12


did you mean this by "own server or community"?

  
  
Posted one year ago

Hi @<1541592204353474560:profile|GhastlySeaurchin98>

During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log:

This looks like the HPO algorithm doing early stopping, which algo are you using ?

  
  
Posted one year ago

@<1523701205467926528:profile|AgitatedDove14> yeah, just checked my code and here's a snippet which uses EarlyStopping :

        history = regressor.fit(
            x=train_generator,
            validation_data=test_generator,
            epochs=epochs,
            callbacks=[
                board,
                LearningRateScheduler(scheduler),
                EarlyStopping(monitor="loss", patience=20, restore_best_weights=True),
            ],
        )

But does it mean it will abort the whole experiment? Its a problem since we are fitting multiple models in a single experiment.

  
  
Posted one year ago

Hi @<1541592204353474560:profile|GhastlySeaurchin98> , how are you running the experiments - which type of machines - local or cloud? Are you running your own server or using the community?

  
  
Posted one year ago

we are running it locally on a self hosted server with a single 3080Ti. ClearML server and worker is on the same machine

  
  
Posted one year ago

@<1523701070390366208:profile|CostlyOstrich36> yeah then we are definately on self-hosted.

  
  
Posted one year ago

@<1541954614918647808:profile|EmaciatedCormorant16> FYI

  
  
Posted one year ago

@<1523701070390366208:profile|CostlyOstrich36> thanks for your insight. then we will probably check Optuna's code and see if we can eliminate this behaviour, if not we are gonna rethink our workflow

  
  
Posted one year ago

It looks like you're on a self hosted server, the community server is app.clear.ml where you can just sign up and don't have to maintain your own server 🙂

  
  
Posted one year ago

We are using community version I belive, since we are not paying.

  
  
Posted one year ago

I am using Optuna

  
  
Posted one year ago

@<1541592204353474560:profile|GhastlySeaurchin98> , I think this is more related to how Optuna works, it aborts the experiment. I think you would need to modify something in order for it to run the way you want

  
  
Posted one year ago
597 Views
12 Answers
one year ago
one year ago
Tags