Reputation
Badges 1
164 × Eureka!now trying with added lines as Alon suggested:
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
)...
I get the same error with those added lines
Unfortunately, waiting a while did not make this go away 🙂
TimelyPenguin76 , Could the problem be related to an error in the log of the previous step (which completed successfully)?
` 2022-07-26 04:25:56,923 - clearml.Task - INFO - Waiting to finish uploads
2022-07-26 04:26:01,447 - clearml.storage - ERROR - Failed uploading: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/clearml-evaluation/o?uploadType=multipart (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_M...
Hey Alon,
See
https://clearml.slack.com/archives/CTK20V944/p1658892624753219
I was able to isolate this as a bug in clearml 1.6.3rc1
that can be reproduced outside of a task / app simply be doing get_local_copy() on a dataset with parents.
on the same topic. What if (I were able to iterate and) I wanted the pipelines calls to be blocking so that the next pipeline executes only after the previous one completes?
here is the log from the failing component:File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/utilities/locks/portalocker.py", line 140, in lock fcntl.flock(file_.fileno(), flags) BlockingIOError: [Errno 11] Resource temporarily unavailable
(I'm going to stop the autoscaler, terminate all the instances and clone the autoscaler and retry it all from the beginning)
also - some issue on the autoscaler side:
If
Dataset.upload()
does not crash or return a success value that I can check and
Are you saying that with this error showing upload data does not crash? (edited)
Unfortunately that is correct. It continues as if nothing happened!
To replicate this in linux (even with max_workers=1
):
https://averagelinuxuser.com/limit-bandwidth-linux/ to throttle your connection: sudo apt-get install wondershaper
Throttle your connection to 1mb/s with somethin...
start a training task. From what I can tell from the console log, the agent hasn't actually started running the component.
This is the component code. It is a wrapper around a non-component training function
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
...
I have tried this several times now. Sometimes one runs an the other fails and sometimes both fail with this same error
the component is called twice in the pipeline using a ThreadedPoolExecutor to parallelize training steps
That would be a better message however, I must have misunderstood the meaning of auto_create=True
I thought that flag made the get function into a "get-or-create"
also, whereas the pipeline agent's log has:Executing task id [7a0ad1fb243a4ff3b9e6c477442ded4a]: repository = git@github.com:shpigi/clearml_evaluation.git branch = main version_num = e045904094cf2f4fa61ce92f7b91682f5de64ab8
The component agent's log has:Executing task id [90de043e354b4b28a84d5cc0788fe63c]: repository = branch = version_num =
in order for the autoscaler to access your git , in the wizard you have to provide the git user/token
git_pass
has the token
Perhaps I should have mentined that I start the AWS autoscaler with the https://app.clear.ml/applications/aws-autoscaler/ .
Hmm, how does the decorator of the component looks like ? meaning did you specify a repo/branch/commit there
Neither my pipeline decorator not my component specify any repos:
` # pipeline
@PipelineDecorator.pipeline(
name=...
Yes. I thought this happened automagically with the current git repo when I send a pipeline for execution from my local python environment. Shouldn't it?
It seems to have happened with the agent running the pipeline task.
I'll try adding repo
and repo_branch
to the pipeline.component decorator
AgitatedDove14
Adding adding repo
and repo_branch
to the pipeline.component decorator worked (and I can move on to my next issue 🙂 ).
I'm still unclear on why cloning the repo in use happens automatically for the pipeline task and not for component tasks.
Sure. It is a minor change from the code in the clearml examples for pipelines.
I just repeat the last two pipeline steps from that code in a loop (x3)
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
I think this should be a valid use of pipelines. for example - at some step I choose to sweep across several values of some parameter and the rest of the steps are duplicated for each value of that parameter.
The additional edges in the graph suggest that these steps somehow contain dependencies that I do not wish them to have.
oops, I deleted two messages here because I had a bug in a test I've done.
I'm retesting now
I found that instead of returning some_returned_url
(which triggers zipping and saving of the filed under that url), I can wrap it in a dict: {"the url": some_returned_url}
which then lets me pass back the url to the pipeline and only that dict gets uploaded (e.g. {'run_datasets_path': Path('/data/my_datasets_path/run_id_1')}
) I can divert all files that I do want uploaded and tracked by clearml to gs://
by adding at start of task-fuction: ` Logger.current_logger().se...
Is there a way to set the default upload destination for all tasks in my ~/clearml.conf
.. yes by setting files_server:
gs://clearml-evaluation/
sort of. Though it seems like the rules for model.name can be a bit non-obvious.
I think that the first model saved gets the task name as its name and the following models take f"{task_name} - {file_name}"
I suppose one way to perform this is with a https://clear.ml/docs/latest/docs/references/sdk/scheduler that kicks off a health check task (check exit state of executed tasks). It seems more efficient to support a triggered response to task fail.
There may be cases where failure occurs before my code starts to run (and, perhaps, after it completes)
Yes.
Some mechanism that would allow for followup code execution. Ideally in a way that would not be susceptible to the same things that may cause a task to fail.
that's strange because, opening the currently running autoscaler config I see this:
did you mean that I was running in CPU mode? I'll tried both but I'll try cpu mode with that base docker image