Reputation
Badges 1
166 × Eureka!thanks. Seems like I was on the right path. Do datasets specified as parents need to be https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk/#finalizing-a-dataset ?
console output shows uploads of 500 files on every new dataset. The lineage is as expected, each additional upload is the same size as the previous ones (~50mb) and Dataset.get
on the last dataset's ID retreives all the files from the separate parts to one local folder.
Checking the remote storage location (gs://) shows artifact zip files, each with 500 files
Sure. It is a minor change from the code in the clearml examples for pipelines.
I just repeat the last two pipeline steps from that code in a loop (x3)
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
I now get this error:2022-07-18 21:51:29,168 - clearml.storage - ERROR - Failed creating storage object
Reason: [Errno 2] No such file or directory: '~/gs.cred'
to be clear, I replaced <this is your GCP storage credentials file>
with the contents of that file, escaping every "
with a \"
and removing newlines.
feature request: tell me what gets passed along each edge of the pipeline graph
It seems to be doing ok on the app side:
I didn't realise Datasets had tasks associated with them but there is one and it seems to be doing ok.
I've attached it's log file which only mentions skipping one file (a warning)
Ooh nice.
I wasn't aware task.models["output"]
also acts like a dict.
I can get the one I care about in my code with something like task.models["output"]["best_model"]
however can you see the inconsistency between the key and the name there:
Thanks AgitatedDove14 for all the guidance.
That job was using clearml 1.8.3 so I take it that setting max_workers to 1 would not make a difference?
Looking at the docs:
https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload
they say that max_workers = number of cores but looking at the log it does seem like it's doing one chunk every 5 minutes (long time for 500mb upload for a node running in gcp...)
cool. How can I get started with hyper datasets? is it part of the clearml package?
Is it limited to https://clear.ml/pricing/?gclid=Cj0KCQjw5ZSWBhCVARIsALERCvzehkqVOiqJPaum5fsVyyTNMKce91PBHZd1IhQpEFaKvV7toze2A_0aAgXXEALw_wcB accounts?
That would be a better message however, I must have misunderstood the meaning of auto_create=True
I thought that flag made the get function into a "get-or-create"
I ran another version of the above code whereoutput_uri="./random_dataset_local_target"
(i.e. db target on local disk instead of gcp).
I still see large memory usage.
I also find it worrisome that while generating the random dataset and writing it to disk took under 3 minutes, generating the hash took 9 minutes and saving the files to a dataset target in an adjacent folder took 30 minutes (10 times longer than writing the original files)! Simply copying the files to an adjacent folde...
I imagine that one workaround is to
Disable automatic model uploads Perform manual model upload (with the correct name).Can you point me to how to do these?
Thanks,
Just to be clear, you are saying the "random" results are consistent over runs ?
yes !
By re-runs I mean re-running this script (not cloning the pipeline)
multi_instance_support=True
lets me run the pipeline again 👍
The second run prints out the same (non) "random" numbers as the first run
TimelyPenguin76 , this turned out to be the reason I was having locking issues https://clearml.slack.com/archives/CTK20V944/p1658761943458649 :
SweetBadger76 , CostlyOstrich36 : I've attempted essentially the same thing before https://clearml.slack.com/archives/CTK20V944/p1657124102133519 and I thought it had worked in the past so I'm not sure why it is failing me now.
Yes. I thought this happened automagically with the current git repo when I send a pipeline for execution from my local python environment. Shouldn't it?
It seems to have happened with the agent running the pipeline task.
I'll try adding repo
and repo_branch
to the pipeline.component decorator
essentially, several running processes were performing:model_evals_dataset = Dataset.get( dataset_project=dataset_project, dataset_name=f"model_evals", ) model_evals_dataset.add_files(run_eval_path) model_evals_dataset.upload()
trying the AWS Autoscaler for the first time I get his error on instance spin up:An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-04c0416d6bd8e4b1f]' does not exist
I tried both us-west-2
and us-east-1b
(thinking it might be zone specific).
I'm not sure if this is a permissions issue or a config issue.
The same occures when I try a different image:ami-06bafe528da33cdb8
(an aws public image)
My local environment has clearml version 1.6.3rc0
and agents in aws were started with the AWS Autoscaler which has no explicit place for google credentials.
I see a place for Additional ClearML Configuration
in the AWS autoscaler UI which I suspect may help but I don't see how I can pass a secrets file along with my agent.
I think this should be a valid use of pipelines. for example - at some step I choose to sweep across several values of some parameter and the rest of the steps are duplicated for each value of that parameter.
The additional edges in the graph suggest that these steps somehow contain dependencies that I do not wish them to have.
In fact, all my projects seems empty of tasks.
perhaps anecdotal but just calling random.seed()
will set the seed using the system time for you
https://docs.python.org/3/library/random.html#random.seed
Trying to switch to a resources using gpu-enabled VMs failed with that same error above.
Looking at spawned VMs, they were spawned by the autoscaler without gpu even though I checked that my settings ( n1-standard-1
and nvidia-tesla-t4
and https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10?project=ml-tooling-test-external image for the VM) can be used to make vm instances and my gcp autoscaler...
so..
I restarted the autoscaler with this configuration object:
` [{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": fa...