Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
I Started Two Pipelines (Using Aws Autoscaler In App.Clear.Ml ). The Pipelines Ran Concurrently, Using The Same Pipeline Code. Both Failed In The Same Component Half-Way Though The Pipeline Run With:

I started two pipelines (using AWS autoscaler in app.clear.ml ). The pipelines ran concurrently, using the same pipeline code. Both failed in the same component half-way though the pipeline run with:
clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable(all components were assigned to "default" queue)
aws shows 4 instances up to support the default queue.

  
  
Posted one year ago
Votes Newest

Answers 22


now trying with added lines as Alon suggested:
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
):
import shutil
shutil._USE_CP_SENDFILE = False
import sys

sys.path.insert(0, "/src/clearml_evaluation/")
from image_classifier_training import training_functions

return training_functions.train_image_classifier(
    clearml_dataset,
    backbone_name,
    image_resize,
    batch_size,
    run_model_uri,
    run_tb_uri,
    local_data_path,
    num_epochs,
) `
  
  
Posted one year ago

Another issue, may, or may not be related.
Running another pipeline (to see if I can reproduce the issue with simple code), it looks like the autoscaler has spun down all the instances for the default queue while a component was still running.
Both the pipline view and the "All experiment" view shows the component as running.
The component's console show that last command was a docker run command

  
  
Posted one year ago

switching back to version 1.6.2. cleared this issue (but re-introduced others for which I have been using the release candidate)

  
  
Posted one year ago

Restarting the autoscaler, instances and a running single pipeline - I still get the same error.
clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable

  
  
Posted one year ago

I get the same error with those added lines

  
  
Posted one year ago

TimelyPenguin76 , Could the problem be related to an error in the log of the previous step (which completed successfully)?
2022-07-26 04:25:56,923 - clearml.Task - INFO - Waiting to finish uploads 2022-07-26 04:26:01,447 - clearml.storage - ERROR - Failed uploading: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/clearml-evaluation/o?uploadType=multipart (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)'))) 2022-07-26 04:26:06,676 - clearml.Task - INFO - Finished uploading Process completed successfullyThe first task returns a clearml.Dataset that is passed to the task that fails to start with that lock error.
(In this case - the dataset already exists so the step just finds it and returns it)

  
  
Posted one year ago

Im checking it

  
  
Posted one year ago

Hey PanickyMoth78 ,

Regarding
clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailableI read a bit https://bugs.python.org/issue43743 , can you try the suggested workaround (just for the check)?

adding

import shutil shutil._USE_CP_SENDFILE = Falseon top?

  
  
Posted one year ago

PanickyMoth78 are you getting this from the app or one of the tasks?

  
  
Posted one year ago

also weirdly, the failed pipeline task is sometimes marked as failed and at other times it is marked as completed

  
  
Posted one year ago

correct

  
  
Posted one year ago

start a training task. From what I can tell from the console log, the agent hasn't actually started running the component.
This is the component code. It is a wrapper around a non-component training function
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
):
import sys

sys.path.insert(0, "/src/clearml_evaluation/")
from image_classifier_training import training_functions

return training_functions.train_image_classifier(
    clearml_dataset,
    backbone_name,
    image_resize,
    batch_size,
    run_model_uri,
    run_tb_uri,
    local_data_path,
    num_epochs,
) `
  
  
Posted one year ago

Hey Alon,
See
https://clearml.slack.com/archives/CTK20V944/p1658892624753219
I was able to isolate this as a bug in clearml 1.6.3rc1 that can be reproduced outside of a task / app simply be doing get_local_copy() on a dataset with parents.

  
  
Posted one year ago

Unfortunately, waiting a while did not make this go away 🙂

  
  
Posted one year ago

the same occures when I run a single training component instead of two

  
  
Posted one year ago

also - some issue on the autoscaler side:

  
  
Posted one year ago

I have tried this several times now. Sometimes one runs an the other fails and sometimes both fail with this same error

  
  
Posted one year ago

(I'm going to stop the autoscaler, terminate all the instances and clone the autoscaler and retry it all from the beginning)

  
  
Posted one year ago

PanickyMoth78 can I verify the setup with you?
python 3.8?
nvidia/cuda:11.2.2-runtime-ubuntu20.04 as image?

  
  
Posted one year ago

the component is called twice in the pipeline using a ThreadedPoolExecutor to parallelize training steps

  
  
Posted one year ago

here is the log from the failing component:
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/utilities/locks/portalocker.py", line 140, in lock fcntl.flock(file_.fileno(), flags) BlockingIOError: [Errno 11] Resource temporarily unavailable

  
  
Posted one year ago

Hi PanickyMoth78 ,

What is the step trying to do when you hit the exception?

  
  
Posted one year ago
696 Views
22 Answers
one year ago
one year ago
Tags
Similar posts