Reputation
Badges 1
166 × Eureka!On the bright side, we started off with agents failing to run on VMs so this is progress 🙂
also, whereas the pipeline agent's log has:Executing task id [7a0ad1fb243a4ff3b9e6c477442ded4a]: repository = git@github.com:shpigi/clearml_evaluation.git branch = main version_num = e045904094cf2f4fa61ce92f7b91682f5de64ab8
The component agent's log has:Executing task id [90de043e354b4b28a84d5cc0788fe63c]: repository = branch = version_num =
any news on this? I also got a similar issue
For me the problem sort of went away. My code evolved a bit after posting this so that dataset creation and training tasks run in separate python sessions. I did not investigate further.
I mean that it was uploading console logs scalar plots and images fine just a while ago and then it seems to have stopped uploading all scalar plot metrics and the figures but log upload was still fine.
Anyway, it is back to working properly now without any code change (as far as I can tell. I tried commenting out a line or two and then brought them all back)
If I end up with something reproducible I'll post here.
switching back to version 1.6.2. cleared this issue (but re-introduced others for which I have been using the release candidate)
did you mean that I was running in CPU mode? I'll tried both but I'll try cpu mode with that base docker image
Thanks,
Just to be clear, you are saying the "random" results are consistent over runs ?
yes !
By re-runs I mean re-running this script (not cloning the pipeline)
start a training task. From what I can tell from the console log, the agent hasn't actually started running the component.
This is the component code. It is a wrapper around a non-component training function
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
...
This idea seems to work.
I tested this for a scenario where data is periodically added to a dataset and, to "version" the steps, I create a new dataset with the old as parent:
To do so, I split a set of image files into separate folders (pets_000, pets_001, ... pets_015), each with 500 image files
I then run the code here to make the datasets.
Another issue, may, or may not be related.
Running another pipeline (to see if I can reproduce the issue with simple code), it looks like the autoscaler has spun down all the instances for the default queue while a component was still running.
Both the pipline view and the "All experiment" view shows the component as running.
The component's console show that last command was a docker run command
feature request: tell me what gets passed along each edge of the pipeline graph
thanks. Switching to SummaryWriter shouldn't be hard for us.
AgitatedDove14
Adding adding repo and repo_branch to the pipeline.component decorator worked (and I can move on to my next issue 🙂 ).
I'm still unclear on why cloning the repo in use happens automatically for the pipeline task and not for component tasks.
I've also not figured out how to modify the examples above to wait for one pipline to end before the next begins
I'll try and reproduce this in simpler code
Thanks TimelyPenguin76 .
From your reply I understand that I have control over what the destination is but that all files generated in a task get transferred regardless of the return_values decorator argument. Is that correct? Can I disable auto-save of artifacts?
Ideally, I'd like to have better control over what gets auto-saved. E.g. I'm happy for tensorboard events to be captured and shown in clearml and for matplotlib figures to be uploaded (perhaps to gcs) but I'd like to avoid ...
I've already tried restarting my laptop (and the docker container where my code is running)
nice, so a pipeline of pipelines is sort of possible. I guess that whole script can be run as a (remote) task?
thanks. Seems like I was on the right path. Do datasets specified as parents need to be https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk/#finalizing-a-dataset ?
multi_instance_support=True lets me run the pipeline again 👍
The second run prints out the same (non) "random" numbers as the first run
Yeah. I was only using the task for the process of creating the dataset.
My code does start out with a step that checks for the existence of the dataset, returning it if it exists (search by project name/dataset name/version) rather than recreating it.
I noticed the name mismatch when that check kept failing me...
I think that init-ing the encompassing task with the relevant dataset name still allows me to search for the dataset by dataset_name=task_name / project_name (shared by both datas...
Hi Martin. See that ValueError https://clearml.slack.com/archives/CTK20V944/p1657583310364049?thread_ts=1657582739.354619&cid=CTK20V944 Perhaps something else is going on?
actually, re-running pipeline_from_decorator.py a second time (and a third time) from the command line seem to have executed without the that ValueError so maybe that issue was some fluke.
Nevertheless, those runs exit prior to lineprint('process completed')
and I would definitely prefer the command executing_pipeline to not kill the process that called it.
For example, maybe, having started the pipeline I'd like my code to also report having started the pipeline to som...
trying the AWS Autoscaler for the first time I get his error on instance spin up:An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-04c0416d6bd8e4b1f]' does not existI tried both us-west-2 and us-east-1b (thinking it might be zone specific).
I'm not sure if this is a permissions issue or a config issue.
The same occures when I try a different image:ami-06bafe528da33cdb8
(an aws public image)
These paths are pathlib.Path . Would that be a problem?
cool. How can I get started with hyper datasets? is it part of the clearml package?
Is it limited to https://clear.ml/pricing/?gclid=Cj0KCQjw5ZSWBhCVARIsALERCvzehkqVOiqJPaum5fsVyyTNMKce91PBHZd1IhQpEFaKvV7toze2A_0aAgXXEALw_wcB accounts?
Hi John. sort of. It seems that archiving pipelines does not also archive the tasks that they contain so /projects/lavi-testing/.pipelines/fastai_image_classification_pipeline is a very long list..
yes
here is the true "my_pipeline" declaration:
` @PipelineDecorator.pipeline(
name="fastai_image_classification_pipeline",
project="lavi-testing",
target_project="lavi-testing",
version="0.2",
multi_instance_support="",
add_pipeline_tags=True,
abort_on_failure=True,
)
def fastai_image_classification_pipeline(
run_tags: List[str],
i_dataset: int,
backbone_names: List[str],
image_resizes: List[int],
batch_sizes: List[int],
num_train_epochs: i...