Reputation
Badges 1
164 × Eureka!also - some issue on the autoscaler side:
(I'm going to stop the autoscaler, terminate all the instances and clone the autoscaler and retry it all from the beginning)
Hey Alon,
See
https://clearml.slack.com/archives/CTK20V944/p1658892624753219
I was able to isolate this as a bug in clearml 1.6.3rc1
that can be reproduced outside of a task / app simply be doing get_local_copy() on a dataset with parents.
Another issue, may, or may not be related.
Running another pipeline (to see if I can reproduce the issue with simple code), it looks like the autoscaler has spun down all the instances for the default queue while a component was still running.
Both the pipline view and the "All experiment" view shows the component as running.
The component's console show that last command was a docker run
command
I tried playing with those parameters on my laptop to no great effect.
Here is code you can use to reproduce the issue:
` import os
from pathlib import Path
from tqdm import tqdm
from clearml import Dataset, Task
def dataset_upload_test(project_id:str, bucket_name:str
):
def _random_file(fpath, sizekb):
fileSizeInBytes = 1024 * sizekb
with open(fpath, "wb") as fout:
fout.write(os.urandom(fileSizeInBytes))
def random_dataset(dataset_path, num_files, file...
Thanks,
Just to be clear, you are saying the "random" results are consistent over runs ?
yes !
By re-runs I mean re-running this script (not cloning the pipeline)
I think this should be a valid use of pipelines. for example - at some step I choose to sweep across several values of some parameter and the rest of the steps are duplicated for each value of that parameter.
The additional edges in the graph suggest that these steps somehow contain dependencies that I do not wish them to have.
These paths are pathlib.Path
. Would that be a problem?
It seems to be doing ok on the app side:
I didn't realise Datasets had tasks associated with them but there is one and it seems to be doing ok.
I've attached it's log file which only mentions skipping one file (a warning)
start a training task. From what I can tell from the console log, the agent hasn't actually started running the component.
This is the component code. It is a wrapper around a non-component training function
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
...
AgitatedDove14
Adding adding repo
and repo_branch
to the pipeline.component decorator worked (and I can move on to my next issue 🙂 ).
I'm still unclear on why cloning the repo in use happens automatically for the pipeline task and not for component tasks.
or, barring that, something similar on AWS?
In case anyone else is interested. We found two alternative solutions:
Repeating the first steps but from within a Docker container ( docker run -it --rm python:3.9 bash
) worked for me.alternatively
The example tasks (or at least those I've tried) that appear in the clear ml examples within a new workspace have clearml==0.17.5
(an old clearml version) listed in "INSTALLED PACKAGES". Updating the clearml package within the task to 1.5.0
let me run the clear-ml agent daemon lo...
here is the code in text if you feel like giving it a try:import tensorboard_logger as tb_logger from clearml import Task task = Task.init(project_name="great project", task_name="test_tb_logging") task_tb_logger = tb_logger.Logger(logdir='./tb/run1', flush_secs=2) for i in range(10): task_tb_logger.log_value("some_metric", 42, i) task.close()
Would you expect this fastai callback to work?
(Uses SummaryWriter):
https://github.com/fastai/fastai/blob/d7f4863f1ee3c0fa9f2d9feeb6a05f0625a53696/fastai/callback/tensorboard.py
It seems to have failed as well (but I'd need to check more carefully)
thanks. Switching to SummaryWriter shouldn't be hard for us.
Thanks AgitatedDove14
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).
My main concern now is that this may happen within a pipeline leading to unreliable data handling.
If Dataset.upload()
does not crash or return a success value that I can check and if Dataste.get_local_copy()
also does not complain as it retrieves partial data - how will I ever know that I lost part of my dataset?
maybe this line should take a timeout argument?
https://github.com/allegroai/clearml/blob/d45ec5d3e2caf1af477b37fcb36a81595fb9759f/clearml/storage/helper.py#L1834
If
Dataset.upload()
does not crash or return a success value that I can check and
Are you saying that with this error showing upload data does not crash? (edited)
Unfortunately that is correct. It continues as if nothing happened!
To replicate this in linux (even with max_workers=1
):
https://averagelinuxuser.com/limit-bandwidth-linux/ to throttle your connection: sudo apt-get install wondershaper
Throttle your connection to 1mb/s with somethin...
I can't find version 1.8.1rc1
but I believe I see a relevant change in code of Dataset.upload
in 1.8.1rc0
There may be cases where failure occurs before my code starts to run (and, perhaps, after it completes)
I suppose one way to perform this is with a https://clear.ml/docs/latest/docs/references/sdk/scheduler that kicks off a health check task (check exit state of executed tasks). It seems more efficient to support a triggered response to task fail.