I Started Two Pipelines (Using Aws Autoscaler In App.Clear.Ml ). The Pipelines Ran Concurrently, Using The Same Pipeline Code. Both Failed In The Same Component Half-Way Though The Pipeline Run With:

Answered

I started two pipelines (using AWS autoscaler in app.clear.ml ). The pipelines ran concurrently, using the same pipeline code. Both failed in the same component half-way though the pipeline run with:
clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable(all components were assigned to "default" queue)
aws shows 4 instances up to support the default queue.

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Votes Newest

Answers 22

now trying with added lines as Alon suggested:
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
):
import shutil
shutil._USE_CP_SENDFILE = False
import sys

sys.path.insert(0, "/src/clearml_evaluation/")
from image_classifier_training import training_functions

return training_functions.train_image_classifier(
    clearml_dataset,
    backbone_name,
    image_resize,
    batch_size,
    run_model_uri,
    run_tb_uri,
    local_data_path,
    num_epochs,
) `

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

also weirdly, the failed pipeline task is sometimes marked as failed and at other times it is marked as completed

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Another issue, may, or may not be related.
Running another pipeline (to see if I can reproduce the issue with simple code), it looks like the autoscaler has spun down all the instances for the default queue while a component was still running.
Both the pipline view and the "All experiment" view shows the component as running.
The component's console show that last command was a docker run command

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

start a training task. From what I can tell from the console log, the agent hasn't actually started running the component.
This is the component code. It is a wrapper around a non-component training function
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
):
import sys

sys.path.insert(0, "/src/clearml_evaluation/")
from image_classifier_training import training_functions

return training_functions.train_image_classifier(
    clearml_dataset,
    backbone_name,
    image_resize,
    batch_size,
    run_model_uri,
    run_tb_uri,
    local_data_path,
    num_epochs,
) `

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I get the same error with those added lines

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

PanickyMoth78 can I verify the setup with you?
python 3.8?
nvidia/cuda:11.2.2-runtime-ubuntu20.04 as image?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

here is the log from the failing component:
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/utilities/locks/portalocker.py", line 140, in lock fcntl.flock(file_.fileno(), flags) BlockingIOError: [Errno 11] Resource temporarily unavailable

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

the component is called twice in the pipeline using a ThreadedPoolExecutor to parallelize training steps

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

(I'm going to stop the autoscaler, terminate all the instances and clone the autoscaler and retry it all from the beginning)

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

TimelyPenguin76 , Could the problem be related to an error in the log of the previous step (which completed successfully)?
2022-07-26 04:25:56,923 - clearml.Task - INFO - Waiting to finish uploads 2022-07-26 04:26:01,447 - clearml.storage - ERROR - Failed uploading: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/clearml-evaluation/o?uploadType=multipart (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)'))) 2022-07-26 04:26:06,676 - clearml.Task - INFO - Finished uploading Process completed successfullyThe first task returns a clearml.Dataset that is passed to the task that fails to start with that lock error.
(In this case - the dataset already exists so the step just finds it and returns it)

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I have tried this several times now. Sometimes one runs an the other fails and sometimes both fail with this same error

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

correct

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Restarting the autoscaler, instances and a running single pipeline - I still get the same error.
clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Unfortunately, waiting a while did not make this go away 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

the same occures when I run a single training component instead of two

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

also - some issue on the autoscaler side:

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hey Alon,
See
https://clearml.slack.com/archives/CTK20V944/p1658892624753219
I was able to isolate this as a bug in clearml 1.6.3rc1 that can be reproduced outside of a task / app simply be doing get_local_copy() on a dataset with parents.

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hey PanickyMoth78 ,

Regarding
clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailableI read a bit https://bugs.python.org/issue43743 , can you try the suggested workaround (just for the check)?

adding

import shutil shutil._USE_CP_SENDFILE = Falseon top?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

PanickyMoth78 are you getting this from the app or one of the tasks?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

switching back to version 1.6.2. cleared this issue (but re-introduced others for which I have been using the release candidate)

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hi PanickyMoth78 ,

What is the step trying to do when you hit the exception?

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Im checking it

  				
Posted 
	2 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Write your answer

2K Views

22 Answers

2 years ago