I now get this error:2022-07-18 21:51:29,168 - clearml.storage - ERROR - Failed creating storage object
Reason: [Errno 2] No such file or directory: '~/gs.cred'
to be clear, I replaced <this is your GCP storage credentials file>
with the contents of that file, escaping every "
with a \"
and removing newlines.
Ooh nice.
I wasn't aware task.models["output"]
also acts like a dict.
I can get the one I care about in my code with something like task.models["output"]["best_model"]
however can you see the inconsistency between the key and the name there:
That job was using clearml 1.8.3 so I take it that setting max_workers to 1 would not make a difference?
Looking at the docs:
https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload
they say that max_workers = number of cores but looking at the log it does seem like it's doing one chunk every 5 minutes (long time for 500mb upload for a node running in gcp...)
cool. How can I get started with hyper datasets? is it part of the clearml package?
Is it limited to https://clear.ml/pricing/?gclid=Cj0KCQjw5ZSWBhCVARIsALERCvzehkqVOiqJPaum5fsVyyTNMKce91PBHZd1IhQpEFaKvV7toze2A_0aAgXXEALw_wcB accounts?
I ran another version of the above code whereoutput_uri="./random_dataset_local_target"
(i.e. db target on local disk instead of gcp).
I still see large memory usage.
I also find it worrisome that while generating the random dataset and writing it to disk took under 3 minutes, generating the hash took 9 minutes and saving the files to a dataset target in an adjacent folder took 30 minutes (10 times longer than writing the original files)! Simply copying the files to an adjacent folde...
I imagine that one workaround is to
Disable automatic model uploads Perform manual model upload (with the correct name).Can you point me to how to do these?
Thanks,
Just to be clear, you are saying the "random" results are consistent over runs ?
yes !
By re-runs I mean re-running this script (not cloning the pipeline)
multi_instance_support=True
lets me run the pipeline again 👍
The second run prints out the same (non) "random" numbers as the first run
TimelyPenguin76 , this turned out to be the reason I was having locking issues https://clearml.slack.com/archives/CTK20V944/p1658761943458649 :
SweetBadger76 , CostlyOstrich36 : I've attempted essentially the same thing before https://clearml.slack.com/archives/CTK20V944/p1657124102133519 and I thought it had worked in the past so I'm not sure why it is failing me now.
In fact, all my projects seems empty of tasks.
I suppose one way to perform this is with a https://clear.ml/docs/latest/docs/references/sdk/scheduler that kicks off a health check task (check exit state of executed tasks). It seems more efficient to support a triggered response to task fail.
I'll try a more carefully checked run a bit later but I know it's getting a bit late in your time zone
switching the base image seems to have failed with the following error :2022-07-13 14:31:12 Unable to find image 'nvidia/cuda:10.2-runtime-ubuntu18.04' locally
attached is a pipeline task log file
yes
here is the true "my_pipeline" declaration:
` @PipelineDecorator.pipeline(
name="fastai_image_classification_pipeline",
project="lavi-testing",
target_project="lavi-testing",
version="0.2",
multi_instance_support="",
add_pipeline_tags=True,
abort_on_failure=True,
)
def fastai_image_classification_pipeline(
run_tags: List[str],
i_dataset: int,
backbone_names: List[str],
image_resizes: List[int],
batch_sizes: List[int],
num_train_epochs: i...
nice, so a pipeline of pipelines is sort of possible. I guess that whole script can be run as a (remote) task?
erm,
this parallelization has led to the pipeline task issuing a bunch of:model_path/run_2022_07_20T22_11_15.209_0.zip , err: [Errno 28] No space left on device
and quitting on me.
my train_image_classifier_component
is programmed to save model files to a local path which is returned (and, thanks to clearml, the path's contents are zipped uploded to the files service).
I take it that these files are also brought into pipeline tasks's local disk?
Why is that? If that is indeed what...
Where was it running?
this message appears in the pipeline task's log. It is preceded by lines that reflect the storage manager downloading a corresponding zip file
I take it that these files are also brought into pipeline tasks's local disk?
Unless you changed the object, then no, they should not be downloaded (the "link" is passed)
The object is run_model_path
I don't seem to be changing it. I just pass it along from the training component to the evaluation compo...
Note that the same models files were previously also generated by a non-paralelized version of the same pipeline without the out-of-space error but a storage manager was downloading zip files in that version as well (maybe these files were downloaded and removed as the object reference counts went to 0?)
That is a good point, I'll make sure we mention it somewhere in the docs. Any thoughts on where?
maybe in (all of) these places:
https://clear.ml/docs/latest/docs/faq
https://clear.ml/docs/latest/docs/fundamentals/task
https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk
also - are there plans for the pipeline view to show artefacts (as in - links to things returned from components)
I should also mention I use clearml==1.6.3rc0
on the same topic. What if (I were able to iterate and) I wanted the pipelines calls to be blocking so that the next pipeline executes only after the previous one completes?
I tried the first option and it worked 🙂 🙏
What I think would be preferable is that the pipeline be deployed and that the python process that deployed it were allowed to continue on to whatever I had planned for it to do next (i.e. not exit)
This example seems to suffice
Perhaps I should mention that I use gs as my files service ( files_server:
gs://clearml-evaluation/ )
` from clearml.automation.controller import PipelineDecorator
from clearml import TaskTypes
@PipelineDecorator.component(
return_values=["large_file_path"], cache=False, task_type=TaskTypes.data_processing
)
def step_write(i: int):
import os
large_file_path = f"/tmp/out_path_{i}"
os.makedirs(large_file_path)
with open(f"{large_file_pa...
Two values:
`
@PipelineDecorator.component(
return_values=["run_model_path", "run_tb_path"],
cache=False,
task_type=TaskTypes.training,
packages=[
"clearml",
"tensorboard_logger",
"timm",
"fastai",
"torch==1.11.0",
"torchvision==0.12.0",
"protobuf==3.19.*",
"tensorboard",
"google-cloud-storage>=1.13.2",
],
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
)
def train_ima...
thanks for explaining it. Makes sense 👍 I'll give it a try