now trying with added lines as Alon suggested:
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
):
import shutil
shutil._USE_CP_SENDFILE = False
import sys
sys.path.insert(0, "/src/clearml_evaluation/")
from image_classifier_training import training_functions
return training_functions.train_image_classifier(
clearml_dataset,
backbone_name,
image_resize,
batch_size,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs,
) `
also weirdly, the failed pipeline task is sometimes marked as failed
and at other times it is marked as completed
the same occures when I run a single training component instead of two
PanickyMoth78 are you getting this from the app or one of the tasks?
start a training task. From what I can tell from the console log, the agent hasn't actually started running the component.
This is the component code. It is a wrapper around a non-component training function
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
):
import sys
sys.path.insert(0, "/src/clearml_evaluation/")
from image_classifier_training import training_functions
return training_functions.train_image_classifier(
clearml_dataset,
backbone_name,
image_resize,
batch_size,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs,
) `
(I'm going to stop the autoscaler, terminate all the instances and clone the autoscaler and retry it all from the beginning)
switching back to version 1.6.2. cleared this issue (but re-introduced others for which I have been using the release candidate)
Another issue, may, or may not be related.
Running another pipeline (to see if I can reproduce the issue with simple code), it looks like the autoscaler has spun down all the instances for the default queue while a component was still running.
Both the pipline view and the "All experiment" view shows the component as running.
The component's console show that last command was a docker run
command
Hi PanickyMoth78 ,
What is the step trying to do when you hit the exception?
I get the same error with those added lines
here is the log from the failing component:File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/utilities/locks/portalocker.py", line 140, in lock fcntl.flock(file_.fileno(), flags) BlockingIOError: [Errno 11] Resource temporarily unavailable
Hey PanickyMoth78 ,
Regardingclearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable
I read a bit https://bugs.python.org/issue43743 , can you try the suggested workaround (just for the check)?
adding
import shutil shutil._USE_CP_SENDFILE = False
on top?
I have tried this several times now. Sometimes one runs an the other fails and sometimes both fail with this same error
PanickyMoth78 can I verify the setup with you?
python 3.8?
nvidia/cuda:11.2.2-runtime-ubuntu20.04 as image?
the component is called twice in the pipeline using a ThreadedPoolExecutor to parallelize training steps
Hey Alon,
See
https://clearml.slack.com/archives/CTK20V944/p1658892624753219
I was able to isolate this as a bug in clearml 1.6.3rc1
that can be reproduced outside of a task / app simply be doing get_local_copy() on a dataset with parents.
TimelyPenguin76 , Could the problem be related to an error in the log of the previous step (which completed successfully)?2022-07-26 04:25:56,923 - clearml.Task - INFO - Waiting to finish uploads 2022-07-26 04:26:01,447 - clearml.storage - ERROR - Failed uploading: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/clearml-evaluation/o?uploadType=multipart (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)'))) 2022-07-26 04:26:06,676 - clearml.Task - INFO - Finished uploading Process completed successfully
The first task returns a clearml.Dataset that is passed to the task that fails to start with that lock error.
(In this case - the dataset already exists so the step just finds it and returns it)
Unfortunately, waiting a while did not make this go away 🙂
Restarting the autoscaler, instances and a running single pipeline - I still get the same error.clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable