Hey PanickyMoth78 ,
Regardingclearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable
I read a bit https://bugs.python.org/issue43743 , can you try the suggested workaround (just for the check)?
adding
import shutil shutil._USE_CP_SENDFILE = False
on top?
switching back to version 1.6.2. cleared this issue (but re-introduced others for which I have been using the release candidate)
Restarting the autoscaler, instances and a running single pipeline - I still get the same error.clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable
start a training task. From what I can tell from the console log, the agent hasn't actually started running the component.
This is the component code. It is a wrapper around a non-component training function
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
):
import sys
sys.path.insert(0, "/src/clearml_evaluation/")
from image_classifier_training import training_functions
return training_functions.train_image_classifier(
clearml_dataset,
backbone_name,
image_resize,
batch_size,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs,
) `
now trying with added lines as Alon suggested:
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
):
import shutil
shutil._USE_CP_SENDFILE = False
import sys
sys.path.insert(0, "/src/clearml_evaluation/")
from image_classifier_training import training_functions
return training_functions.train_image_classifier(
clearml_dataset,
backbone_name,
image_resize,
batch_size,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs,
) `
I get the same error with those added lines
TimelyPenguin76 , Could the problem be related to an error in the log of the previous step (which completed successfully)?2022-07-26 04:25:56,923 - clearml.Task - INFO - Waiting to finish uploads 2022-07-26 04:26:01,447 - clearml.storage - ERROR - Failed uploading: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/clearml-evaluation/o?uploadType=multipart (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)'))) 2022-07-26 04:26:06,676 - clearml.Task - INFO - Finished uploading Process completed successfully
The first task returns a clearml.Dataset that is passed to the task that fails to start with that lock error.
(In this case - the dataset already exists so the step just finds it and returns it)
Unfortunately, waiting a while did not make this go away 🙂
Hey Alon,
See
https://clearml.slack.com/archives/CTK20V944/p1658892624753219
I was able to isolate this as a bug in clearml 1.6.3rc1
that can be reproduced outside of a task / app simply be doing get_local_copy() on a dataset with parents.
PanickyMoth78 are you getting this from the app or one of the tasks?
the component is called twice in the pipeline using a ThreadedPoolExecutor to parallelize training steps
I have tried this several times now. Sometimes one runs an the other fails and sometimes both fail with this same error
Hi PanickyMoth78 ,
What is the step trying to do when you hit the exception?
here is the log from the failing component:File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/utilities/locks/portalocker.py", line 140, in lock fcntl.flock(file_.fileno(), flags) BlockingIOError: [Errno 11] Resource temporarily unavailable
the same occures when I run a single training component instead of two
PanickyMoth78 can I verify the setup with you?
python 3.8?
nvidia/cuda:11.2.2-runtime-ubuntu20.04 as image?
Another issue, may, or may not be related.
Running another pipeline (to see if I can reproduce the issue with simple code), it looks like the autoscaler has spun down all the instances for the default queue while a component was still running.
Both the pipline view and the "All experiment" view shows the component as running.
The component's console show that last command was a docker run
command
also weirdly, the failed pipeline task is sometimes marked as failed
and at other times it is marked as completed
(I'm going to stop the autoscaler, terminate all the instances and clone the autoscaler and retry it all from the beginning)