Hi John. sort of. It seems that archiving pipelines does not also archive the tasks that they contain so /projects/lavi-testing/.pipelines/fastai_image_classification_pipeline
is a very long list..
did you mean that I was running in CPU mode? I'll tried both but I'll try cpu mode with that base docker image
Thanks AgitatedDove14
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).
My main concern now is that this may happen within a pipeline leading to unreliable data handling.
If Dataset.upload()
does not crash or return a success value that I can check and if Dataste.get_local_copy()
also does not complain as it retrieves partial data - how will I ever know that I lost part of my dataset?
Trying to switch to a resources using gpu-enabled VMs failed with that same error above.
Looking at spawned VMs, they were spawned by the autoscaler without gpu even though I checked that my settings ( n1-standard-1
and nvidia-tesla-t4
and https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10?project=ml-tooling-test-external image for the VM) can be used to make vm instances and my gcp autoscaler...
Hi TimelyPenguin76
Thanks for working on this. The clearml gcp autoscaler is a major feature for us to have. I can't really evaluate clearml without some means of instantiating multiple agents on GCP machines and I'd really prefer not to have to set up a k8 cluster with agents and manage scaling it myself.
I tried the settings above with two resources, one for default queue and one for the services queue (making sure I use that image you suggested above for both).
The autoscaler started up...
I can't find version 1.8.1rc1
but I believe I see a relevant change in code of Dataset.upload
in 1.8.1rc0
I have a task where I create a dataset but I also create a set of matplotlib figures, some numeric statistics and a pandas table that describe the data which I wish to have associated with the dataset and vieawable from the clearml web page for the dataset.
For componenttask=Task.current_task()
Will get me the task object. (right?)
This does not work for pipeline. Is pipeline a task?
Edit: The same works for pipeline
Here are screen shots of a VM I started with a gpu and one stared by the autoscaler with the setting above but whose GPU is missing (both in the zame gcp zone, us-central1-f ) . I may have misconfigured something or perhaps the autoscaler is failing to specify the GPU requirement correctly. :shrug:
oops, should it have been multi_instance_support=True
?
I'll try a more carefully checked run a bit later but I know it's getting a bit late in your time zone
actually, re-running pipeline_from_decorator.py
a second time (and a third time) from the command line seem to have executed without the that ValueError so maybe that issue was some fluke.
Nevertheless, those runs exit prior to lineprint('process completed')
and I would definitely prefer the command executing_pipeline
to not kill the process that called it.
For example, maybe, having started the pipeline I'd like my code to also report having started the pipeline to som...
You can have
parents
as one of the
@PipelineDecorator.component
args. The step will be executed only after all the
parents
are executed and completed
Is there an example of using parents some place? Im not sure what to pass and also, how to pass a component from one pipeline that was just kicked off to execute remotely (which I'd like to block on) to a component of the next pipeline's run
Thanks,
Just to be clear, you are saying the "random" results are consistent over runs ?
yes !
By re-runs I mean re-running this script (not cloning the pipeline)
I'm on clearml 1.6.2
The jupyter notebook service and two clear-ml agents ( version1.3.0, one in queue "default" and one in queue "services" and with --cpu-only flag) ) are all running inside a docker container
I was doing it with the task that I had been using. Mostly for logging arguments that control what the dataset will contain.
on the same topic. What if (I were able to iterate and) I wanted the pipelines calls to be blocking so that the next pipeline executes only after the previous one completes?
Yeah. I was only using the task for the process of creating the dataset.
My code does start out with a step that checks for the existence of the dataset, returning it if it exists (search by project name/dataset name/version) rather than recreating it.
I noticed the name mismatch when that check kept failing me...
I think that init-ing the encompassing task with the relevant dataset name still allows me to search for the dataset by dataset_name=task_name / project_name (shared by both datas...
What I think would be preferable is that the pipeline be deployed and that the python process that deployed it were allowed to continue on to whatever I had planned for it to do next (i.e. not exit)
I believe n1-standard-8
would work for that. I initially just tried going with the autoscaler defaults which has gpu on but that n1-standard-1
specified as the machine
I have google-cloud-storage==2.6.0
installed
yes
here is the true "my_pipeline" declaration:
` @PipelineDecorator.pipeline(
name="fastai_image_classification_pipeline",
project="lavi-testing",
target_project="lavi-testing",
version="0.2",
multi_instance_support="",
add_pipeline_tags=True,
abort_on_failure=True,
)
def fastai_image_classification_pipeline(
run_tags: List[str],
i_dataset: int,
backbone_names: List[str],
image_resizes: List[int],
batch_sizes: List[int],
num_train_epochs: i...
I don't mind assigning to the task the same name that I'd assign to the dataset. I just think that the create function should expect dataset_name
to be None in the case of use_current_task=True
(or allow the dataset name to differ from the task name)
I now get this error:2022-07-18 21:51:29,168 - clearml.storage - ERROR - Failed creating storage object
Reason: [Errno 2] No such file or directory: '~/gs.cred'
to be clear, I replaced <this is your GCP storage credentials file>
with the contents of that file, escaping every "
with a \"
and removing newlines.
In fact, all my projects seems empty of tasks.
multi_instance_support=True
lets me run the pipeline again 👍
The second run prints out the same (non) "random" numbers as the first run
switching the base image seems to have failed with the following error :2022-07-13 14:31:12 Unable to find image 'nvidia/cuda:10.2-runtime-ubuntu18.04' locally
attached is a pipeline task log file