Reputation
Badges 1
166 × Eureka!Yes.
Some mechanism that would allow for followup code execution. Ideally in a way that would not be susceptible to the same things that may cause a task to fail.
I'm looking for a minimal set of permissions because we have other sensitive ec2 instances running in the same account and our IT people are rightfully concerned about providing access to that account externally.
My local environment has clearml version 1.6.3rc0
and agents in aws were started with the AWS Autoscaler which has no explicit place for google credentials.
I see a place for Additional ClearML Configuration in the AWS autoscaler UI which I suspect may help but I don't see how I can pass a secrets file along with my agent.
We're using a self-hosted clearml server version 1.14.0
It seems to be doing ok on the app side:
I didn't realise Datasets had tasks associated with them but there is one and it seems to be doing ok.
I've attached it's log file which only mentions skipping one file (a warning)
Thanks AgitatedDove14 for all the guidance.
Thanks ! 🎉
I'll give it a try.
I think that clearml should be able to do parameter sweeps using pipelines in a manner that makes use of parallelisation.
If that's not happening with the new RC, I wonder how I would do a parameter sweep within the pipelines framework.
For example - how would this task-based example be done with pipelines?
https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py
I'm thinking of a case where you want t...
there may have been some interaction between the training task and a preceding dataset creation task :shrug:
no retry mesages
CLEARML_FILES_HOST is gs
CLEARML_API_HOST is a self hosted clearml server (in google compute engine).
Note that earlier in the process the code uploads a dataset just fine
That's amazing speed 🚀
That would be a better message however, I must have misunderstood the meaning of auto_create=True
I thought that flag made the get function into a "get-or-create"
To be specific there is "model name" which is not unique , and there is model-key which is unique to the Task
not sure why the two fields don't simply match. I guess that there may be situations where file name (without the full path) may be used several times.
I found that instead of returning some_returned_url (which triggers zipping and saving of the filed under that url), I can wrap it in a dict: {"the url": some_returned_url} which then lets me pass back the url to the pipeline and only that dict gets uploaded (e.g. {'run_datasets_path': Path('/data/my_datasets_path/run_id_1')} ) I can divert all files that I do want uploaded and tracked by clearml to gs:// by adding at start of task-fuction: ` Logger.current_logger().se...
Is there a way to set the default upload destination for all tasks in my ~/clearml.conf
.. yes by setting files_server: gs://clearml-evaluation/
Restarting the autoscaler, instances and a running single pipeline - I still get the same error.clearml.utilities.locks.exceptions.LockException: [Errno 11] Resource temporarily unavailable
now trying with added lines as Alon suggested:
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
clearml_dataset,
backbone_name,
image_resize: int,
batch_size: int,
run_model_uri,
run_tb_uri,
local_data_path,
num_epochs: int,
)...
so..
I restarted the autoscaler with this configuration object:
` [{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": fa...
Thanks for the fix and the mock HPO example code !
Pipeline behaviour with the fix is looking good.
I see the point about changes to data inside the controller possibly causing dependencies for step 3 (or, at least, making it harder for the interpreter to know).
Note that the same models files were previously also generated by a non-paralelized version of the same pipeline without the out-of-space error but a storage manager was downloading zip files in that version as well (maybe these files were downloaded and removed as the object reference counts went to 0?)
sort of. Though it seems like the rules for model.name can be a bit non-obvious.
I think that the first model saved gets the task name as its name and the following models take f"{task_name} - {file_name}"
here is the code in text if you feel like giving it a try:import tensorboard_logger as tb_logger from clearml import Task task = Task.init(project_name="great project", task_name="test_tb_logging") task_tb_logger = tb_logger.Logger(logdir='./tb/run1', flush_secs=2) for i in range(10): task_tb_logger.log_value("some_metric", 42, i) task.close()
that's strange because, opening the currently running autoscaler config I see this:
I now get this error:2022-07-18 21:51:29,168 - clearml.storage - ERROR - Failed creating storage object Reason: [Errno 2] No such file or directory: '~/gs.cred'
to be clear, I replaced <this is your GCP storage credentials file> with the contents of that file, escaping every " with a \" and removing newlines.
I can't find version 1.8.1rc1 but I believe I see a relevant change in code of Dataset.upload in 1.8.1rc0
I suppose one way to perform this is with a https://clear.ml/docs/latest/docs/references/sdk/scheduler that kicks off a health check task (check exit state of executed tasks). It seems more efficient to support a triggered response to task fail.
I think this should be a valid use of pipelines. for example - at some step I choose to sweep across several values of some parameter and the rest of the steps are duplicated for each value of that parameter.
The additional edges in the graph suggest that these steps somehow contain dependencies that I do not wish them to have.
There may be cases where failure occurs before my code starts to run (and, perhaps, after it completes)

