Reputation
Badges 1
164 × Eureka!Yes.
Some mechanism that would allow for followup code execution. Ideally in a way that would not be susceptible to the same things that may cause a task to fail.
the component is called twice in the pipeline using a ThreadedPoolExecutor to parallelize training steps
That would be a better message however, I must have misunderstood the meaning of auto_create=True
I thought that flag made the get function into a "get-or-create"
now trying with added lines as Alon suggested:
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
)...
the same occures when I run a single training component instead of two
I get the same error with those added lines
I've also not figured out how to modify the examples above to wait for one pipline to end before the next begins
TimelyPenguin76 , Could the problem be related to an error in the log of the previous step (which completed successfully)?
` 2022-07-26 04:25:56,923 - clearml.Task - INFO - Waiting to finish uploads
2022-07-26 04:26:01,447 - clearml.storage - ERROR - Failed uploading: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/clearml-evaluation/o?uploadType=multipart (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_M...
Another issue, may, or may not be related.
Running another pipeline (to see if I can reproduce the issue with simple code), it looks like the autoscaler has spun down all the instances for the default queue while a component was still running.
Both the pipline view and the "All experiment" view shows the component as running.
The component's console show that last command was a docker run
command
Unfortunately, waiting a while did not make this go away 🙂
yes
here is the true "my_pipeline" declaration:
` @PipelineDecorator.pipeline(
name="fastai_image_classification_pipeline",
project="lavi-testing",
target_project="lavi-testing",
version="0.2",
multi_instance_support="",
add_pipeline_tags=True,
abort_on_failure=True,
)
def fastai_image_classification_pipeline(
run_tags: List[str],
i_dataset: int,
backbone_names: List[str],
image_resizes: List[int],
batch_sizes: List[int],
num_train_epochs: i...
on the same topic. What if (I were able to iterate and) I wanted the pipelines calls to be blocking so that the next pipeline executes only after the previous one completes?
oops, should it have been multi_instance_support=True
?
(fixed small typo in code above just now)
My local environment has clearml version 1.6.3rc0
and agents in aws were started with the AWS Autoscaler which has no explicit place for google credentials.
I see a place for Additional ClearML Configuration
in the AWS autoscaler UI which I suspect may help but I don't see how I can pass a secrets file along with my agent.
For anyone following, you can "inject" a credentials json file for a google cloud service account so at to get access to your google cloud storage from agents on aws ec2 instances that are managed by the AWS autoscaler by providing the following in the ADDITIONAL CLEARML CONFIGURATION
when starting the autoscaler:
` sdk.google.storage.credentials_json: "/root/gs.cred"
sdk.google.storage.project: "<my-gcp-project-id>"
files {
gsc {
contents: """<copy-paste the contents of yo...
in order for the autoscaler to access your git , in the wizard you have to provide the git user/token
git_pass
has the token
Perhaps I should have mentined that I start the AWS autoscaler with the https://app.clear.ml/applications/aws-autoscaler/ .
Hmm, how does the decorator of the component looks like ? meaning did you specify a repo/branch/commit there
Neither my pipeline decorator not my component specify any repos:
` # pipeline
@PipelineDecorator.pipeline(
name=...
Would you expect this fastai callback to work?
(Uses SummaryWriter):
https://github.com/fastai/fastai/blob/d7f4863f1ee3c0fa9f2d9feeb6a05f0625a53696/fastai/callback/tensorboard.py
It seems to have failed as well (but I'd need to check more carefully)
I believe n1-standard-8
would work for that. I initially just tried going with the autoscaler defaults which has gpu on but that n1-standard-1
specified as the machine
I'll give it a try.
And if I wanted to support GPU in the default
queue, are you saying that I'd need a different machine from the n1-standard-1
?
switching the base image seems to have failed with the following error :2022-07-13 14:31:12 Unable to find image 'nvidia/cuda:10.2-runtime-ubuntu18.04' locally
attached is a pipeline task log file
Is there any chance the experiment itself has a docker image specified?
It does not as far as I know. The decorators do not have docker fields specified
Trying to switch to a resources using gpu-enabled VMs failed with that same error above.
Looking at spawned VMs, they were spawned by the autoscaler without gpu even though I checked that my settings ( n1-standard-1
and nvidia-tesla-t4
and https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10?project=ml-tooling-test-external image for the VM) can be used to make vm instances and my gcp autoscaler...
I'm on clearml 1.6.2
The jupyter notebook service and two clear-ml agents ( version1.3.0, one in queue "default" and one in queue "services" and with --cpu-only flag) ) are all running inside a docker container
If I run from terminal, I see:ValueError: Task object can only be updated if created or in_progress [status=stopped fields=['configuration']]