
Reputation
Badges 1
162 × Eureka!I'm on clearml 1.6.2
The jupyter notebook service and two clear-ml agents ( version1.3.0, one in queue "default" and one in queue "services" and with --cpu-only flag) ) are all running inside a docker container
Hmm interesting, so like a callback?!
like https://github.com/allegroai/clearml/blob/bca9a6de3095f411ae5b766d00967535a13e8401/examples/pipeline/pipeline_from_tasks.py#L54-L55 pipe-step level callbacks? I guess that mechanism could serve. Where do these callbacks run? In the instantiating process? If so, that would work (since the callback function can be any code I wish, right?)
I might want to dispatch other jobs from within the same process.
This is actually something t...
Would you expect this fastai callback to work?
(Uses SummaryWriter):
https://github.com/fastai/fastai/blob/d7f4863f1ee3c0fa9f2d9feeb6a05f0625a53696/fastai/callback/tensorboard.py
It seems to have failed as well (but I'd need to check more carefully)
Right. Thanks.
With several models saved by the training process (whose code is not task-aware) I suspect that doing the update call after training completed will only update the last of the uploaded models.
I'm currently looking at a workaround where:
I disable auto saving by https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk/#automatic-logging Manually upload the models Manually register the models with https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345b...
Where was it running?
this message appears in the pipeline task's log. It is preceded by lines that reflect the storage manager downloading a corresponding zip file
I take it that these files are also brought into pipeline tasks's local disk?
Unless you changed the object, then no, they should not be downloaded (the "link" is passed)
The object is run_model_path
I don't seem to be changing it. I just pass it along from the training component to the evaluation compo...
Two values:
`
@PipelineDecorator.component(
return_values=["run_model_path", "run_tb_path"],
cache=False,
task_type=TaskTypes.training,
packages=[
"clearml",
"tensorboard_logger",
"timm",
"fastai",
"torch==1.11.0",
"torchvision==0.12.0",
"protobuf==3.19.*",
"tensorboard",
"google-cloud-storage>=1.13.2",
],
repo="git@github.com:shpigi/clearml_evaluation.git",
repo_branch="main",
)
def train_ima...
erm,
this parallelization has led to the pipeline task issuing a bunch of:model_path/run_2022_07_20T22_11_15.209_0.zip , err: [Errno 28] No space left on device
and quitting on me.
my train_image_classifier_component
is programmed to save model files to a local path which is returned (and, thanks to clearml, the path's contents are zipped uploded to the files service).
I take it that these files are also brought into pipeline tasks's local disk?
Why is that? If that is indeed what...
Note that the same models files were previously also generated by a non-paralelized version of the same pipeline without the out-of-space error but a storage manager was downloading zip files in that version as well (maybe these files were downloaded and removed as the object reference counts went to 0?)
Thanks ! 🎉
I'll give it a try.
I think that clearml should be able to do parameter sweeps using pipelines in a manner that makes use of parallelisation.
If that's not happening with the new RC, I wonder how I would do a parameter sweep within the pipelines framework.
For example - how would this task-based example be done with pipelines?
https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py
I'm thinking of a case where you want t...
I imagine that one workaround is to
Disable automatic model uploads Perform manual model upload (with the correct name).Can you point me to how to do these?
BTW:
If I try to find the right model in the
task.models["output"]
(this time there is just one but in my code there may be several) it appears with the
(see other attached screenshot).
What would make sense here ? (I have to be honest I'm not sure).
If the model was saved with a file name (is that the trigger for auto-upload?), I think it makes sense for the model name to match the file name (not the task name), especially when there may be ...
I imagine that these phantom dependencies will prevent parallelization. Is there a workaround?
Hi Martin. See that ValueError
https://clearml.slack.com/archives/CTK20V944/p1657583310364049?thread_ts=1657582739.354619&cid=CTK20V944 Perhaps something else is going on?
What I think would be preferable is that the pipeline be deployed and that the python process that deployed it were allowed to continue on to whatever I had planned for it to do next (i.e. not exit)
first, thanks for having these discussions. I appreciate this kind of support is an effort 🙏
Yes. i perfectly understand that once a pipeline job (or a task) is sent off in this manner, it executes separately (and, most likely in a different machine) from the process that instantiated it.
I still feel strongly that such a command should not be thought of as a fire and exit operation. I can think of several scenarios where continued execution of the instantiating process is desired:
I ...
actually, re-running pipeline_from_decorator.py
a second time (and a third time) from the command line seem to have executed without the that ValueError so maybe that issue was some fluke.
Nevertheless, those runs exit prior to lineprint('process completed')
and I would definitely prefer the command executing_pipeline
to not kill the process that called it.
For example, maybe, having started the pipeline I'd like my code to also report having started the pipeline to som...
uploads are a bit slow though (~4 minutes for 50mb)
did you mean that I was running in CPU mode? I'll tried both but I'll try cpu mode with that base docker image
Is there any chance the experiment itself has a docker image specified?
It does not as far as I know. The decorators do not have docker fields specified
I'll give it a try.
And if I wanted to support GPU in the default
queue, are you saying that I'd need a different machine from the n1-standard-1
?
I'm connecting to the hosted clear.ml
packages in use are:# Python 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] clearml == 1.6.2 fastai == 2.7.5
in case it matters, I'm running this code in a jupyter notebook within a docker container (to keep things vell isolated). The /data
path is volume mapped to my local filesystem (and, in fact, already contains the dataset files, so the fastai call to untar_data should see the data there and return immediately)
That same make_data fu...
The pipeline eventually completed after ~20 minutes and the log shows it has downloaded a 755mb file.
I can also download the zip file from the artifacts tab for the component now.
Why is the data being up/down loaded? Can I prevent that?
I get that clearml likes to take good care of my data but I must be doing something wrong here as it doesn't make sense for a dataset to be uploaded to files.clear.ml
.
Note that if I change the component to return a regular meaningless string - "mock_path"
, the pipeline completes rather quickly and the dataset is not uploaded.
Thanks TimelyPenguin76 .
From your reply I understand that I have control over what the destination is but that all files generated in a task get transferred regardless of the return_values
decorator argument. Is that correct? Can I disable auto-save of artifacts?
Ideally, I'd like to have better control over what gets auto-saved. E.g. I'm happy for tensorboard events to be captured and shown in clearml and for matplotlib figures to be uploaded (perhaps to gcs) but I'd like to avoid ...
My local environment has clearml version 1.6.3rc0
and agents in aws were started with the AWS Autoscaler which has no explicit place for google credentials.
I see a place for Additional ClearML Configuration
in the AWS autoscaler UI which I suspect may help but I don't see how I can pass a secrets file along with my agent.
For anyone following, you can "inject" a credentials json file for a google cloud service account so at to get access to your google cloud storage from agents on aws ec2 instances that are managed by the AWS autoscaler by providing the following in the ADDITIONAL CLEARML CONFIGURATION
when starting the autoscaler:
` sdk.google.storage.credentials_json: "/root/gs.cred"
sdk.google.storage.project: "<my-gcp-project-id>"
files {
gsc {
contents: """<copy-paste the contents of yo...
Trying to switch to a resources using gpu-enabled VMs failed with that same error above.
Looking at spawned VMs, they were spawned by the autoscaler without gpu even though I checked that my settings ( n1-standard-1
and nvidia-tesla-t4
and https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10?project=ml-tooling-test-external image for the VM) can be used to make vm instances and my gcp autoscaler...
Thanks for the fix and the mock HPO example code !
Pipeline behaviour with the fix is looking good.
I see the point about changes to data inside the controller possibly causing dependencies for step 3 (or, at least, making it harder for the interpreter to know).
Hi TimelyPenguin76
Thanks for working on this. The clearml gcp autoscaler is a major feature for us to have. I can't really evaluate clearml without some means of instantiating multiple agents on GCP machines and I'd really prefer not to have to set up a k8 cluster with agents and manage scaling it myself.
I tried the settings above with two resources, one for default queue and one for the services queue (making sure I use that image you suggested above for both).
The autoscaler started up...