PanickyMoth78

34 Questions, 167 Answers

Active since 10 January 2023

Last activity 5 months ago

Reputation

Badges 1

166 × Eureka!

Answers 167

0 Another Question On The Topic Of How A Remote Execution Of A Pipeline Kills The Calling Process (Previously Discussed

nice, so a pipeline of pipelines is sort of possible. I guess that whole script can be run as a (remote) task?

2 years ago

0 Hi. I'M Encountering A Problem With

BTW:

If I try to find the right model in the

task.models["output"]

(this time there is just one but in my code there may be several) it appears with the

(see other attached screenshot).

What would make sense here ? (I have to be honest I'm not sure).

If the model was saved with a file name (is that the trigger for auto-upload?), I think it makes sense for the model name to match the file name (not the task name), especially when there may be ...

2 years ago

0 Hi. I'M Encountering A Problem With

sort of. Though it seems like the rules for model.name can be a bit non-obvious.
I think that the first model saved gets the task name as its name and the following models take f"{task_name} - {file_name}"

2 years ago

0 Hi. I'M Encountering A Problem With

To be specific there is "model name" which is not unique , and there is model-key which is unique to the Task

not sure why the two fields don't simply match. I guess that there may be situations where file name (without the full path) may be used several times.

2 years ago

0 Hi. I'M Encountering A Problem With

anyhow - looks like the keys are simple enough to use (so I can just ignore the model names)

2 years ago

0 Hi. I Have A Job That Processes Images And Creates ~5 Gb Of Processed Image Files (Lots Of Small Ones). At The End - It Creates A

Q: is there an equivalent env var for sdk.google.storage.pool_connections/pool_maxsize ? My jobs are running remotely and not within a clearml agent at the moment so they get clearml config through env vars.

2 years ago

0 Autoscaler Parallelization Issue: I Have An Aws Autoscaler Set Up With A Resource That Has A Max Of 3 Instances Assigned To The

sys.path.insert(0, "/src/clearml_evaluation/") is actually left-over code from when I was making things run locally (perhaps prior to connecting to github repo) but I think that adding a non-existent path to the system path would be benign

2 years ago

0 Hi I'M Looking Into How Clearml Supports Datasets And Dataset Versioning And I'M A Bit Confused. Is Dataset Versioning Not Supported At All In The Non-Enterprise Or Is Versioning Available By A Different Mechanism? I See That

thanks. Seems like I was on the right path. Do datasets specified as parents need to be https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk/#finalizing-a-dataset ?

2 years ago

uploads are a bit slow though (~4 minutes for 50mb)

2 years ago

This idea seems to work.
I tested this for a scenario where data is periodically added to a dataset and, to "version" the steps, I create a new dataset with the old as parent:
To do so, I split a set of image files into separate folders (pets_000, pets_001, ... pets_015), each with 500 image files
I then run the code here to make the datasets.

2 years ago

oops, I deleted two messages here because I had a bug in a test I've done.
I'm retesting now

2 years ago

0 I Started Two Pipelines (Using Aws Autoscaler In App.Clear.Ml ). The Pipelines Ran Concurrently, Using The Same Pipeline Code. Both Failed In The Same Component Half-Way Though The Pipeline Run With:

I have tried this several times now. Sometimes one runs an the other fails and sometimes both fail with this same error

2 years ago

the component is called twice in the pipeline using a ThreadedPoolExecutor to parallelize training steps

2 years ago

Hey Alon,
See
https://clearml.slack.com/archives/CTK20V944/p1658892624753219
I was able to isolate this as a bug in clearml 1.6.3rc1 that can be reproduced outside of a task / app simply be doing get_local_copy() on a dataset with parents.

2 years ago

TimelyPenguin76 , Could the problem be related to an error in the log of the previous step (which completed successfully)?
` 2022-07-26 04:25:56,923 - clearml.Task - INFO - Waiting to finish uploads
2022-07-26 04:26:01,447 - clearml.storage - ERROR - Failed uploading: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/clearml-evaluation/o?uploadType=multipart (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_M...

2 years ago

Unfortunately, waiting a while did not make this go away 🙂

2 years ago

Restarting the autoscaler, instances and a running single pipeline - I still get the same error.
clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable

2 years ago

the same occures when I run a single training component instead of two

2 years ago

switching back to version 1.6.2. cleared this issue (but re-introduced others for which I have been using the release candidate)

2 years ago

I get the same error with those added lines

2 years ago

now trying with added lines as Alon suggested:
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
)...

2 years ago

Another issue, may, or may not be related.
Running another pipeline (to see if I can reproduce the issue with simple code), it looks like the autoscaler has spun down all the instances for the default queue while a component was still running.
Both the pipline view and the "All experiment" view shows the component as running.
The component's console show that last command was a docker run command

2 years ago

(I'm going to stop the autoscaler, terminate all the instances and clone the autoscaler and retry it all from the beginning)

2 years ago

here is the log from the failing component:
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/utilities/locks/portalocker.py", line 140, in lock fcntl.flock(file_.fileno(), flags) BlockingIOError: [Errno 11] Resource temporarily unavailable

2 years ago

also - some issue on the autoscaler side:

2 years ago

correct

2 years ago

start a training task. From what I can tell from the console log, the agent hasn't actually started running the component.
This is the component code. It is a wrapper around a non-component training function
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
...

2 years ago

also weirdly, the failed pipeline task is sometimes marked as failed and at other times it is marked as completed

2 years ago

0 Hi. Help

It seems to be doing ok on the app side:
I didn't realise Datasets had tasks associated with them but there is one and it seems to be doing ok.
I've attached it's log file which only mentions skipping one file (a warning)

2 years ago

0 Hi. I'M Running This Little Pipeline:

The pipeline eventually completed after ~20 minutes and the log shows it has downloaded a 755mb file.
I can also download the zip file from the artifacts tab for the component now.
Why is the data being up/down loaded? Can I prevent that?
I get that clearml likes to take good care of my data but I must be doing something wrong here as it doesn't make sense for a dataset to be uploaded to files.clear.ml .

2 years ago

Show more results