PanickyMoth78

34 Questions, 167 Answers

Active since 10 January 2023

Last activity 5 months ago

Reputation

Badges 1

166 × Eureka!

Answers 167

0 Hi (Again... Sorry For Asking So Many Questions) Question About Using Google Cloud Storage In A Clearml Agent Running In Aws Ec2 Instance. My

I now get this error:
2022-07-18 21:51:29,168 - clearml.storage - ERROR - Failed creating storage object Reason: [Errno 2] No such file or directory: '~/gs.cred'
to be clear, I replaced <this is your GCP storage credentials file> with the contents of that file, escaping every " with a \" and removing newlines.

2 years ago

0 Hi. I'M Encountering A Problem With

Ooh nice.
I wasn't aware task.models["output"] also acts like a dict.
I can get the one I care about in my code with something like task.models["output"]["best_model"]
however can you see the inconsistency between the key and the name there:

2 years ago

0 Hi. I Have A Job That Processes Images And Creates ~5 Gb Of Processed Image Files (Lots Of Small Ones). At The End - It Creates A

That job was using clearml 1.8.3 so I take it that setting max_workers to 1 would not make a difference?
Looking at the docs:
https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload
they say that max_workers = number of cores but looking at the log it does seem like it's doing one chunk every 5 minutes (long time for 500mb upload for a node running in gcp...)

2 years ago

0 Hi. Looking Into Clearml Support For Datasets, I'D Like To Understand How To Work With Large Datasets And Cases Where Not All The Data Is Downloaded At Once. (E.G. 1. Each Training Epoch Is Performed On A (Preferably Random) Sample Of The Data That Is Dow

cool. How can I get started with hyper datasets? is it part of the clearml package?
Is it limited to https://clear.ml/pricing/?gclid=Cj0KCQjw5ZSWBhCVARIsALERCvzehkqVOiqJPaum5fsVyyTNMKce91PBHZd1IhQpEFaKvV7toze2A_0aAgXXEALw_wcB accounts?

2 years ago

0 Hi. I Have A Job That Processes Images And Creates ~5 Gb Of Processed Image Files (Lots Of Small Ones). At The End - It Creates A

I ran another version of the above code where
output_uri="./random_dataset_local_target"
(i.e. db target on local disk instead of gcp).
I still see large memory usage.
I also find it worrisome that while generating the random dataset and writing it to disk took under 3 minutes, generating the hash took 9 minutes and saving the files to a dataset target in an adjacent folder took 30 minutes (10 times longer than writing the original files)! Simply copying the files to an adjacent folde...

2 years ago

0 Hi. I'M Encountering A Problem With

I imagine that one workaround is to
Disable automatic model uploads Perform manual model upload (with the correct name).Can you point me to how to do these?

2 years ago

0 Hi. I Have A Few Questions About The Snippet Attached

Thanks,

Just to be clear, you are saying the "random" results are consistent over runs ?

yes !
By re-runs I mean re-running this script (not cloning the pipeline)

2 years ago

0 Hi. I Have A Few Questions About The Snippet Attached

multi_instance_support=True lets me run the pipeline again 👍
The second run prints out the same (non) "random" numbers as the first run

2 years ago

0 Hi. I Am Experimenting With

TimelyPenguin76 , this turned out to be the reason I was having locking issues https://clearml.slack.com/archives/CTK20V944/p1658761943458649 :
SweetBadger76 , CostlyOstrich36 : I've attempted essentially the same thing before https://clearml.slack.com/archives/CTK20V944/p1657124102133519 and I thought it had worked in the past so I'm not sure why it is failing me now.

2 years ago

0 Hi. I Am Experimenting With

I'm on clearml==1.6.3rc1

2 years ago

0 I Have 5 Unarchived Pipeline Runs That Were Defined With This Decorator:

In fact, all my projects seems empty of tasks.

2 years ago

0 Is There Some Built-In Way In Clearml To Trigger Further Action On Task Fail (Or Pipeline Fail)?

I suppose one way to perform this is with a https://clear.ml/docs/latest/docs/references/sdk/scheduler that kicks off a health check task (check exit state of executed tasks). It seems more efficient to support a triggered response to task fail.

2 years ago

0 Hi. I'D Like To Try The Gcp Autoscaler.

I'll try a more carefully checked run a bit later but I know it's getting a bit late in your time zone

2 years ago

0 Hi. I'D Like To Try The Gcp Autoscaler.

switching the base image seems to have failed with the following error :
2022-07-13 14:31:12 Unable to find image 'nvidia/cuda:10.2-runtime-ubuntu18.04' locallyattached is a pipeline task log file

2 years ago

0 Another Question On The Topic Of How A Remote Execution Of A Pipeline Kills The Calling Process (Previously Discussed

yes
here is the true "my_pipeline" declaration:
` @PipelineDecorator.pipeline(
name="fastai_image_classification_pipeline",
project="lavi-testing",
target_project="lavi-testing",
version="0.2",
multi_instance_support="",
add_pipeline_tags=True,
abort_on_failure=True,
)
def fastai_image_classification_pipeline(
run_tags: List[str],
i_dataset: int,
backbone_names: List[str],
image_resizes: List[int],
batch_sizes: List[int],
num_train_epochs: i...

2 years ago

0 Another Question On The Topic Of How A Remote Execution Of A Pipeline Kills The Calling Process (Previously Discussed

nice, so a pipeline of pipelines is sort of possible. I guess that whole script can be run as a (remote) task?

2 years ago

0 Autoscaler Parallelization Issue: I Have An Aws Autoscaler Set Up With A Resource That Has A Max Of 3 Instances Assigned To The

erm,
this parallelization has led to the pipeline task issuing a bunch of:
model_path/run_2022_07_20T22_11_15.209_0.zip , err: [Errno 28] No space left on deviceand quitting on me.
my train_image_classifier_component is programmed to save model files to a local path which is returned (and, thanks to clearml, the path's contents are zipped uploded to the files service).

I take it that these files are also brought into pipeline tasks's local disk?
Why is that? If that is indeed what...

2 years ago

0 Autoscaler Parallelization Issue: I Have An Aws Autoscaler Set Up With A Resource That Has A Max Of 3 Instances Assigned To The

👍

2 years ago

0 Autoscaler Parallelization Issue: I Have An Aws Autoscaler Set Up With A Resource That Has A Max Of 3 Instances Assigned To The

Where was it running?

this message appears in the pipeline task's log. It is preceded by lines that reflect the storage manager downloading a corresponding zip file

I take it that these files are also brought into pipeline tasks's local disk?

Unless you changed the object, then no, they should not be downloaded (the "link" is passed)

The object is run_model_path
I don't seem to be changing it. I just pass it along from the training component to the evaluation compo...

2 years ago

0 Autoscaler Parallelization Issue: I Have An Aws Autoscaler Set Up With A Resource That Has A Max Of 3 Instances Assigned To The

Note that the same models files were previously also generated by a non-paralelized version of the same pipeline without the out-of-space error but a storage manager was downloading zip files in that version as well (maybe these files were downloaded and removed as the object reference counts went to 0?)

2 years ago

0 Hi. I Have A Few Questions About The Snippet Attached

That is a good point, I'll make sure we mention it somewhere in the docs. Any thoughts on where?

maybe in (all of) these places:
https://clear.ml/docs/latest/docs/faq
https://clear.ml/docs/latest/docs/fundamentals/task
https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk

2 years ago

0 Hi. Suppose I Want To Report On What My Task Has Done By Having It Generate A Markdown (.Md) File With Links To Some "Local" Figure Files. Looking At The Reporting Documentation, The Closest Thing I Found Is The

also - are there plans for the pipeline view to show artefacts (as in - links to things returned from components)

2 years ago

0 Autoscaler Parallelization Issue: I Have An Aws Autoscaler Set Up With A Resource That Has A Max Of 3 Instances Assigned To The

I should also mention I use clearml==1.6.3rc0

2 years ago

0 Another Question On The Topic Of How A Remote Execution Of A Pipeline Kills The Calling Process (Previously Discussed

on the same topic. What if (I were able to iterate and) I wanted the pipelines calls to be blocking so that the next pipeline executes only after the previous one completes?

2 years ago

0 I Am Using The Aws Autoscaler And I Wish To Set My Files Server To Be Gs. I Tried To Do So By Having This In The Additional Clearml Configuration Window:

I tried the first option and it worked 🙂 🙏

2 years ago

0 Hi There. I'M Trying To Switch Pipeline Code From A Local Run Using

What I think would be preferable is that the pipeline be deployed and that the python process that deployed it were allowed to continue on to whatever I had planned for it to do next (i.e. not exit)

2 years ago

0 Autoscaler Parallelization Issue: I Have An Aws Autoscaler Set Up With A Resource That Has A Max Of 3 Instances Assigned To The

This example seems to suffice
Perhaps I should mention that I use gs as my files service ( files_server: gs://clearml-evaluation/ )
` from clearml.automation.controller import PipelineDecorator
from clearml import TaskTypes

@PipelineDecorator.component(
return_values=["large_file_path"], cache=False, task_type=TaskTypes.data_processing
)
def step_write(i: int):
import os

large_file_path = f"/tmp/out_path_{i}"
os.makedirs(large_file_path)
with open(f"{large_file_pa...

2 years ago

0 Hi. Question About Dataset Upload Errors: When Uploading A

🙏

2 years ago

0 Autoscaler Parallelization Issue: I Have An Aws Autoscaler Set Up With A Resource That Has A Max Of 3 Instances Assigned To The

Two values:
`
@PipelineDecorator.component(
return_values=["run_model_path", "run_tb_path"],
cache=False,
task_type=TaskTypes.training,
packages=[
"clearml",
"tensorboard_logger",
"timm",
"fastai",
"torch==1.11.0",
"torchvision==0.12.0",
"protobuf==3.19.*",
"tensorboard",
"google-cloud-storage>=1.13.2",
],
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
)
def train_ima...

2 years ago

0 Autoscaler Parallelization Issue: I Have An Aws Autoscaler Set Up With A Resource That Has A Max Of 3 Instances Assigned To The

thanks for explaining it. Makes sense 👍 I'll give it a try

2 years ago

Show more results