Reputation
Badges 1
166 × Eureka!Sure. It is a minor change from the code in the clearml examples for pipelines.
I just repeat the last two pipeline steps from that code in a loop (x3)
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
I now get this error:2022-07-18 21:51:29,168 - clearml.storage - ERROR - Failed creating storage object
Reason: [Errno 2] No such file or directory: '~/gs.cred'
to be clear, I replaced <this is your GCP storage credentials file>
with the contents of that file, escaping every "
with a \"
and removing newlines.
feature request: tell me what gets passed along each edge of the pipeline graph
It seems to be doing ok on the app side:
I didn't realise Datasets had tasks associated with them but there is one and it seems to be doing ok.
I've attached it's log file which only mentions skipping one file (a warning)
Ooh nice.
I wasn't aware task.models["output"]
also acts like a dict.
I can get the one I care about in my code with something like task.models["output"]["best_model"]
however can you see the inconsistency between the key and the name there:
Thanks AgitatedDove14 for all the guidance.
That job was using clearml 1.8.3 so I take it that setting max_workers to 1 would not make a difference?
Looking at the docs:
https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload
they say that max_workers = number of cores but looking at the log it does seem like it's doing one chunk every 5 minutes (long time for 500mb upload for a node running in gcp...)
cool. How can I get started with hyper datasets? is it part of the clearml package?
Is it limited to https://clear.ml/pricing/?gclid=Cj0KCQjw5ZSWBhCVARIsALERCvzehkqVOiqJPaum5fsVyyTNMKce91PBHZd1IhQpEFaKvV7toze2A_0aAgXXEALw_wcB accounts?
That would be a better message however, I must have misunderstood the meaning of auto_create=True
I thought that flag made the get function into a "get-or-create"
I ran another version of the above code whereoutput_uri="./random_dataset_local_target"
(i.e. db target on local disk instead of gcp).
I still see large memory usage.
I also find it worrisome that while generating the random dataset and writing it to disk took under 3 minutes, generating the hash took 9 minutes and saving the files to a dataset target in an adjacent folder took 30 minutes (10 times longer than writing the original files)! Simply copying the files to an adjacent folde...
I imagine that one workaround is to
Disable automatic model uploads Perform manual model upload (with the correct name).Can you point me to how to do these?
Thanks,
Just to be clear, you are saying the "random" results are consistent over runs ?
yes !
By re-runs I mean re-running this script (not cloning the pipeline)
multi_instance_support=True
lets me run the pipeline again 👍
The second run prints out the same (non) "random" numbers as the first run
TimelyPenguin76 , this turned out to be the reason I was having locking issues https://clearml.slack.com/archives/CTK20V944/p1658761943458649 :
SweetBadger76 , CostlyOstrich36 : I've attempted essentially the same thing before https://clearml.slack.com/archives/CTK20V944/p1657124102133519 and I thought it had worked in the past so I'm not sure why it is failing me now.
Yes. I thought this happened automagically with the current git repo when I send a pipeline for execution from my local python environment. Shouldn't it?
It seems to have happened with the agent running the pipeline task.
I'll try adding repo
and repo_branch
to the pipeline.component decorator
essentially, several running processes were performing:model_evals_dataset = Dataset.get( dataset_project=dataset_project, dataset_name=f"model_evals", ) model_evals_dataset.add_files(run_eval_path) model_evals_dataset.upload()
trying the AWS Autoscaler for the first time I get his error on instance spin up:An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-04c0416d6bd8e4b1f]' does not exist
I tried both us-west-2
and us-east-1b
(thinking it might be zone specific).
I'm not sure if this is a permissions issue or a config issue.
The same occures when I try a different image:ami-06bafe528da33cdb8
(an aws public image)
I think this should be a valid use of pipelines. for example - at some step I choose to sweep across several values of some parameter and the rest of the steps are duplicated for each value of that parameter.
The additional edges in the graph suggest that these steps somehow contain dependencies that I do not wish them to have.
In fact, all my projects seems empty of tasks.
perhaps anecdotal but just calling random.seed()
will set the seed using the system time for you
https://docs.python.org/3/library/random.html#random.seed
Trying to switch to a resources using gpu-enabled VMs failed with that same error above.
Looking at spawned VMs, they were spawned by the autoscaler without gpu even though I checked that my settings ( n1-standard-1
and nvidia-tesla-t4
and https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10?project=ml-tooling-test-external image for the VM) can be used to make vm instances and my gcp autoscaler...
so..
I restarted the autoscaler with this configuration object:
` [{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": fa...
Here are screen shots of a VM I started with a gpu and one stared by the autoscaler with the setting above but whose GPU is missing (both in the zame gcp zone, us-central1-f ) . I may have misconfigured something or perhaps the autoscaler is failing to specify the GPU requirement correctly. :shrug:
I suppose one way to perform this is with a https://clear.ml/docs/latest/docs/references/sdk/scheduler that kicks off a health check task (check exit state of executed tasks). It seems more efficient to support a triggered response to task fail.
Re
re-running this code produces the same printoutsI guess repeatable behaviour is a great default to have for, well, repeatability 🙂
I'm able to "randomize" my results by adding a seed
pipeline argument and calling random.seed(seed)
within the pipeline and component. Results then change with change of seed.
I think most veteran ML practitioners are bitten at some point by randomising when they shouldn't and not randomising when they should. It would be nice to have some docu...
Just updating here that I got the AWS autoscaler working with CostlyOstrich36 ’s generous help 🎉
I thought I'd share here some details in case others experience similar difficulties
With regards to permissions, this is the list of actions that the autoscaler uses which your aws account would need to permit:GetConsoleOutput RequestSpotInstances DescribeSpotInstanceRequests RunInstances DescribeInstances TerminateInstances DescribeInstances
the instance image ` ami-04c0416d6bd8e...