Reputation
Badges 1
166 × Eureka!I had several pipeline components getting it and uploading files to is concurrently.
Can Datsets handle that?
I can find the tasks in the "all experiments" project but there are over 500 tasks there (I guess in includes the archived tasks as well) so that's not much help.
I'm looking for a minimal set of permissions because we have other sensitive ec2 instances running in the same account and our IT people are rightfully concerned about providing access to that account externally.
Thanks TimelyPenguin76 .
From your reply I understand that I have control over what the destination is but that all files generated in a task get transferred regardless of the return_values
decorator argument. Is that correct? Can I disable auto-save of artifacts?
Ideally, I'd like to have better control over what gets auto-saved. E.g. I'm happy for tensorboard events to be captured and shown in clearml and for matplotlib figures to be uploaded (perhaps to gcs) but I'd like to avoid ...
I'm connecting to the hosted clear.ml
packages in use are:# Python 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] clearml == 1.6.2 fastai == 2.7.5
in case it matters, I'm running this code in a jupyter notebook within a docker container (to keep things vell isolated). The /data
path is volume mapped to my local filesystem (and, in fact, already contains the dataset files, so the fastai call to untar_data should see the data there and return immediately)
That same make_data fu...
Note that if I change the component to return a regular meaningless string - "mock_path"
, the pipeline completes rather quickly and the dataset is not uploaded.
On the bright side, we started off with agents failing to run on VMs so this is progress 🙂
I'll try a more carefully checked run a bit later but I know it's getting a bit late in your time zone
I can try switching to gpu-enabled machines just to see if that path can be made to work but the services queue shouldn't need gpu so I hope we figure out running the pipeline task on cpu nodes
switching the base image seems to have failed with the following error :2022-07-13 14:31:12 Unable to find image 'nvidia/cuda:10.2-runtime-ubuntu18.04' locally
attached is a pipeline task log file
did you mean that I was running in CPU mode? I'll tried both but I'll try cpu mode with that base docker image
yes
here is the true "my_pipeline" declaration:
` @PipelineDecorator.pipeline(
name="fastai_image_classification_pipeline",
project="lavi-testing",
target_project="lavi-testing",
version="0.2",
multi_instance_support="",
add_pipeline_tags=True,
abort_on_failure=True,
)
def fastai_image_classification_pipeline(
run_tags: List[str],
i_dataset: int,
backbone_names: List[str],
image_resizes: List[int],
batch_sizes: List[int],
num_train_epochs: i...
nice, so a pipeline of pipelines is sort of possible. I guess that whole script can be run as a (remote) task?
erm,
this parallelization has led to the pipeline task issuing a bunch of:model_path/run_2022_07_20T22_11_15.209_0.zip , err: [Errno 28] No space left on device
and quitting on me.
my train_image_classifier_component
is programmed to save model files to a local path which is returned (and, thanks to clearml, the path's contents are zipped uploded to the files service).
I take it that these files are also brought into pipeline tasks's local disk?
Why is that? If that is indeed what...
I believe n1-standard-8
would work for that. I initially just tried going with the autoscaler defaults which has gpu on but that n1-standard-1
specified as the machine
The pipeline eventually completed after ~20 minutes and the log shows it has downloaded a 755mb file.
I can also download the zip file from the artifacts tab for the component now.
Why is the data being up/down loaded? Can I prevent that?
I get that clearml likes to take good care of my data but I must be doing something wrong here as it doesn't make sense for a dataset to be uploaded to files.clear.ml
.
Where was it running?
this message appears in the pipeline task's log. It is preceded by lines that reflect the storage manager downloading a corresponding zip file
I take it that these files are also brought into pipeline tasks's local disk?
Unless you changed the object, then no, they should not be downloaded (the "link" is passed)
The object is run_model_path
I don't seem to be changing it. I just pass it along from the training component to the evaluation compo...
Note that the same models files were previously also generated by a non-paralelized version of the same pipeline without the out-of-space error but a storage manager was downloading zip files in that version as well (maybe these files were downloaded and removed as the object reference counts went to 0?)
That is a good point, I'll make sure we mention it somewhere in the docs. Any thoughts on where?
maybe in (all of) these places:
https://clear.ml/docs/latest/docs/faq
https://clear.ml/docs/latest/docs/fundamentals/task
https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk
also - are there plans for the pipeline view to show artefacts (as in - links to things returned from components)
I should also mention I use clearml==1.6.3rc0
I'll do a clean relaunch of everything (scaler and pipeline)
I found that instead of returning some_returned_url
(which triggers zipping and saving of the filed under that url), I can wrap it in a dict: {"the url": some_returned_url}
which then lets me pass back the url to the pipeline and only that dict gets uploaded (e.g. {'run_datasets_path': Path('/data/my_datasets_path/run_id_1')}
) I can divert all files that I do want uploaded and tracked by clearml to gs://
by adding at start of task-fuction: ` Logger.current_logger().se...
on the same topic. What if (I were able to iterate and) I wanted the pipelines calls to be blocking so that the next pipeline executes only after the previous one completes?
I tried the first option and it worked 🙂 🙏
Hi Martin. See that ValueError
https://clearml.slack.com/archives/CTK20V944/p1657583310364049?thread_ts=1657582739.354619&cid=CTK20V944 Perhaps something else is going on?
What I think would be preferable is that the pipeline be deployed and that the python process that deployed it were allowed to continue on to whatever I had planned for it to do next (i.e. not exit)
This example seems to suffice
Perhaps I should mention that I use gs as my files service ( files_server:
gs://clearml-evaluation/ )
` from clearml.automation.controller import PipelineDecorator
from clearml import TaskTypes
@PipelineDecorator.component(
return_values=["large_file_path"], cache=False, task_type=TaskTypes.data_processing
)
def step_write(i: int):
import os
large_file_path = f"/tmp/out_path_{i}"
os.makedirs(large_file_path)
with open(f"{large_file_pa...