![Profile picture](https://clearml-web-assets.s3.amazonaws.com/scoold/avatars/PanickyMoth78.png)
Reputation
Badges 1
166 × Eureka!maybe this line should take a timeout argument?
https://github.com/allegroai/clearml/blob/d45ec5d3e2caf1af477b37fcb36a81595fb9759f/clearml/storage/helper.py#L1834
I ran another version of the above code whereoutput_uri="./random_dataset_local_target"
(i.e. db target on local disk instead of gcp).
I still see large memory usage.
I also find it worrisome that while generating the random dataset and writing it to disk took under 3 minutes, generating the hash took 9 minutes and saving the files to a dataset target in an adjacent folder took 30 minutes (10 times longer than writing the original files)! Simply copying the files to an adjacent folde...
Thanks TimelyPenguin76 .
From your reply I understand that I have control over what the destination is but that all files generated in a task get transferred regardless of the return_values
decorator argument. Is that correct? Can I disable auto-save of artifacts?
Ideally, I'd like to have better control over what gets auto-saved. E.g. I'm happy for tensorboard events to be captured and shown in clearml and for matplotlib figures to be uploaded (perhaps to gcs) but I'd like to avoid ...
first, thanks for having these discussions. I appreciate this kind of support is an effort 🙏
Yes. i perfectly understand that once a pipeline job (or a task) is sent off in this manner, it executes separately (and, most likely in a different machine) from the process that instantiated it.
I still feel strongly that such a command should not be thought of as a fire and exit operation. I can think of several scenarios where continued execution of the instantiating process is desired:
I ...
Hi TimelyPenguin76
Thanks for working on this. The clearml gcp autoscaler is a major feature for us to have. I can't really evaluate clearml without some means of instantiating multiple agents on GCP machines and I'd really prefer not to have to set up a k8 cluster with agents and manage scaling it myself.
I tried the settings above with two resources, one for default queue and one for the services queue (making sure I use that image you suggested above for both).
The autoscaler started up...
uploads are a bit slow though (~4 minutes for 50mb)
If
Dataset.upload()
does not crash or return a success value that I can check and
Are you saying that with this error showing upload data does not crash? (edited)
Unfortunately that is correct. It continues as if nothing happened!
To replicate this in linux (even with max_workers=1
):
https://averagelinuxuser.com/limit-bandwidth-linux/ to throttle your connection: sudo apt-get install wondershaper
Throttle your connection to 1mb/s with somethin...
also weirdly, the failed pipeline task is sometimes marked as failed
and at other times it is marked as completed
start a training task. From what I can tell from the console log, the agent hasn't actually started running the component.
This is the component code. It is a wrapper around a non-component training function
` @PipelineDecorator.component(
return_values=["run_model_path", "run_info"],
cache=True,
task_type=TaskTypes.training,
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
packages="./requirements.txt",
)
def train_image_classifier_component(
...
I noticed that the base docker image does not appear in the autoscaler task' configuration_object
which is:
` [{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": "", "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gp...
thanks KindChimpanzee37 . Where is that minimal example to be found?
I've also not figured out how to modify the examples above to wait for one pipline to end before the next begins
I'll try a more carefully checked run a bit later but I know it's getting a bit late in your time zone
BTW:
If I try to find the right model in the
task.models["output"]
(this time there is just one but in my code there may be several) it appears with the
(see other attached screenshot).
What would make sense here ? (I have to be honest I'm not sure).
If the model was saved with a file name (is that the trigger for auto-upload?), I think it makes sense for the model name to match the file name (not the task name), especially when there may be ...
To be specific there is "model name" which is not unique , and there is model-key which is unique to the Task
not sure why the two fields don't simply match. I guess that there may be situations where file name (without the full path) may be used several times.
anyhow - looks like the keys are simple enough to use (so I can just ignore the model names)
(I'm going to stop the autoscaler, terminate all the instances and clone the autoscaler and retry it all from the beginning)
also - some issue on the autoscaler side:
Ooh nice.
I wasn't aware task.models["output"]
also acts like a dict.
I can get the one I care about in my code with something like task.models["output"]["best_model"]
however can you see the inconsistency between the key and the name there:
yes. several checkpoints + the one that did best on validation data.
actually, re-running pipeline_from_decorator.py
a second time (and a third time) from the command line seem to have executed without the that ValueError so maybe that issue was some fluke.
Nevertheless, those runs exit prior to lineprint('process completed')
and I would definitely prefer the command executing_pipeline
to not kill the process that called it.
For example, maybe, having started the pipeline I'd like my code to also report having started the pipeline to som...
I'll give it a try.
And if I wanted to support GPU in the default
queue, are you saying that I'd need a different machine from the n1-standard-1
?
here is the code in text if you feel like giving it a try:import tensorboard_logger as tb_logger from clearml import Task task = Task.init(project_name="great project", task_name="test_tb_logging") task_tb_logger = tb_logger.Logger(logdir='./tb/run1', flush_secs=2) for i in range(10): task_tb_logger.log_value("some_metric", 42, i) task.close()
I'm on clearml 1.6.2
The jupyter notebook service and two clear-ml agents ( version1.3.0, one in queue "default" and one in queue "services" and with --cpu-only flag) ) are all running inside a docker container
Hi Martin. See that ValueError
https://clearml.slack.com/archives/CTK20V944/p1657583310364049?thread_ts=1657582739.354619&cid=CTK20V944 Perhaps something else is going on?
or, barring that, something similar on AWS?
another weird thing:
Before my training task is done:print(task.models['output'].keys())
outputsodict_keys(['Output Model #0', 'Output Model #1', 'Output Model #2'])
after task.close()
I can do:task = Task.get_task(task_id) for i in range(100): print(task.models["output"].keys())
which printsodict_keys(['Output Model #0', 'Output Model #1', 'Output Model #2'])
in the first iteration
and prints the file names in the latter iterations:
` od...
Trying to switch to a resources using gpu-enabled VMs failed with that same error above.
Looking at spawned VMs, they were spawned by the autoscaler without gpu even though I checked that my settings ( n1-standard-1
and nvidia-tesla-t4
and https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10?project=ml-tooling-test-external image for the VM) can be used to make vm instances and my gcp autoscaler...
feature request: tell me what gets passed along each edge of the pipeline graph