Reputation
Badges 1
166 × Eureka!Oh, cool. So would this then report the activities of the spawned processes to the same task as that of the spawning process?
Yes.
Some mechanism that would allow for followup code execution. Ideally in a way that would not be susceptible to the same things that may cause a task to fail.
It seems to be doing ok on the app side:
I didn't realise Datasets had tasks associated with them but there is one and it seems to be doing ok.
I've attached it's log file which only mentions skipping one file (a warning)
would setting the max_workers to 1 be a (slower) workaround?
Sure. It is a minor change from the code in the clearml examples for pipelines.
I just repeat the last two pipeline steps from that code in a loop (x3)
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
or, barring that, something similar on AWS?
In case anyone else is interested. We found two alternative solutions:
Repeating the first steps but from within a Docker container ( docker run -it --rm python:3.9 bash ) worked for me.alternatively
The example tasks (or at least those I've tried) that appear in the clear ml examples within a new workspace have clearml==0.17.5 (an old clearml version) listed in "INSTALLED PACKAGES". Updating the clearml package within the task to 1.5.0 let me run the clear-ml agent daemon lo...
I now get this error:2022-07-18 21:51:29,168 - clearml.storage - ERROR - Failed creating storage object Reason: [Errno 2] No such file or directory: '~/gs.cred'
to be clear, I replaced <this is your GCP storage credentials file> with the contents of that file, escaping every " with a \" and removing newlines.
To be specific there is "model name" which is not unique , and there is model-key which is unique to the Task
not sure why the two fields don't simply match. I guess that there may be situations where file name (without the full path) may be used several times.
Ooh nice.
I wasn't aware task.models["output"] also acts like a dict.
I can get the one I care about in my code with something like task.models["output"]["best_model"]
however can you see the inconsistency between the key and the name there:
I tried the first option and it worked 🙂 🙏
I had several pipeline components getting it and uploading files to is concurrently.
Can Datsets handle that?
Hey Alon,
See
https://clearml.slack.com/archives/CTK20V944/p1658892624753219
I was able to isolate this as a bug in clearml 1.6.3rc1 that can be reproduced outside of a task / app simply be doing get_local_copy() on a dataset with parents.
another weird thing:
Before my training task is done:print(task.models['output'].keys())outputsodict_keys(['Output Model #0', 'Output Model #1', 'Output Model #2'])
after task.close()
I can do:task = Task.get_task(task_id) for i in range(100): print(task.models["output"].keys())which printsodict_keys(['Output Model #0', 'Output Model #1', 'Output Model #2'])in the first iteration
and prints the file names in the latter iterations:
` od...
I'll do a clean relaunch of everything (scaler and pipeline)
essentially, several running processes were performing:model_evals_dataset = Dataset.get( dataset_project=dataset_project, dataset_name=f"model_evals", ) model_evals_dataset.add_files(run_eval_path) model_evals_dataset.upload()
I imagine that one workaround is to
Disable automatic model uploads Perform manual model upload (with the correct name).Can you point me to how to do these?
In fact, all my projects seems empty of tasks.
Trying to switch to a resources using gpu-enabled VMs failed with that same error above.
Looking at spawned VMs, they were spawned by the autoscaler without gpu even though I checked that my settings ( n1-standard-1 and nvidia-tesla-t4 and https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10?project=ml-tooling-test-external image for the VM) can be used to make vm instances and my gcp autoscaler...
BTW:
If I try to find the right model in the
task.models["output"]
(this time there is just one but in my code there may be several) it appears with the
(see other attached screenshot).
What would make sense here ? (I have to be honest I'm not sure).
If the model was saved with a file name (is that the trigger for auto-upload?), I think it makes sense for the model name to match the file name (not the task name), especially when there may be ...
Unfortunately, waiting a while did not make this go away 🙂
AgitatedDove14
Adding adding repo and repo_branch to the pipeline.component decorator worked (and I can move on to my next issue 🙂 ).
I'm still unclear on why cloning the repo in use happens automatically for the pipeline task and not for component tasks.
I have google-cloud-storage==2.6.0 installed
(I'm going to stop the autoscaler, terminate all the instances and clone the autoscaler and retry it all from the beginning)
I suppose one way to perform this is with a https://clear.ml/docs/latest/docs/references/sdk/scheduler that kicks off a health check task (check exit state of executed tasks). It seems more efficient to support a triggered response to task fail.
console output shows uploads of 500 files on every new dataset. The lineage is as expected, each additional upload is the same size as the previous ones (~50mb) and Dataset.get on the last dataset's ID retreives all the files from the separate parts to one local folder.
Checking the remote storage location (gs://) shows artifact zip files, each with 500 files
Another issue, may, or may not be related.
Running another pipeline (to see if I can reproduce the issue with simple code), it looks like the autoscaler has spun down all the instances for the default queue while a component was still running.
Both the pipline view and the "All experiment" view shows the component as running.
The component's console show that last command was a docker run command