Reputation
Badges 1
131 × Eureka!And remote workers are managed by the GCP auto-scaler app so I presume those clearml-agent are up-to-date too
And running with a Python 3.10 interpreter
I'm considering doing a PR in a few days to add the param if it is not too complex
Well having a network inbcidient at HQ so this doesn't help.... but I'll keep you updqted with the tests I run tommorow
Nice, thank you for the reactivity ❤
Yup I already setup my aws configs for clearML that way but I needed to have generally accessible credentials too so I used the init script option in this config menu ^^
you correctly assigned a domain and certificate ?
I checked the 'CPU-only' option in the auto-scaler config but that's seemed logic at the time
Takling about that decorator which shouyld also have a docker_arg param since it is executed as an "orchestration component" but the param is missing https://clear.ml/docs/latest/docs/references/sdk/automation_controller_pipelinecontroller/#pipelinedecoratorpipeline
@<1523701087100473344:profile|SuccessfulKoala55> I had already bumped boto3 to its latest version and all the files I added to the dataset were pickle binary files
Well I think most of the time is took by the setup of the venv installing the packages defined in the imports in the pipeline component which is normal and some of those package have a wheel that takes a long time to build but most of those packages where already included on the Docker image I provided and I get that message in my logs:
:: Python virtual environment cache is disabled. To accelerate spin-up time setagent.venvs_cache.path=~/.clearml/venvs-cache:::
Old tags are not deleted. When executing a Task (experiment) remotely, this method has no effect).
This description in the add_tags() doc intrigues me tho, I would like to remove a tag from a dataset and add it to another version (eg: a used_in_last_training tag) and this method seems to only add new tags.
AgitatedDove14 Here you go, I think it's inside the container since it's after the worker pulls the image
As specified in the initial message, the instance type used is e2-standard-4
This is funny cause the auto-scaler on GPU instances is working fine, but as the backtrace suggests it seems to be linked to this instance family
CostlyOstrich36 Should I start a new issue since I pinpointed the exact problem given than the beginning of this one was clearly confusing for both of us ?
Sure but the same pattern can be achieved using explicitly the PipelineController class and defining steps using .add_step() pointing to CLearML's Task objects right ?
The decorators simply abstract away the controller but both methods (decorators or controller/tasks) allows to decouple your pipelines in steps each having an independent compute target, right ?
So basically choosing one method or the other only a question of best-practice or style ?
It's funny cause the line in the backtrace is the correct one so I don't think it has something to do with strange cachine behavior
I suppose you cannot reproduct the issue from your side ?
Maybe it has to do that the faulty code was initially defined as a cached component
Nope same result after having deleted .clearml
If you're using Helm it would be at the service level in your values.yml , not pod level
Looks like you need the https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving and https://clear.ml/docs/latest/docs/pipelines/pipelines features with a https://clear.ml/pricing/ in SaaS deployment so you can use the https://clear.ml/docs/latest/docs/webapp/applications/apps_gcp_autoscaler to manage the workers for you
You can set a dummy step which is executed in parallel of your pre-processing step and which is set to be executed in your GPU queue, provided that your autoscaler doesn't scale back your compute before your pre-processing is complete that should do the trick
Hum, must be more arcane then, I guess the official support would be able to provide an answer, they usually answer within 24 hours
The worker docker image was running on python 3.8 and weare running on a PRO tier SaaS deployment, this failed run is from a few weeks ago and we did not run any pipeline since then
SmugDolphin23 But the training.py has already a CLearML task created under the hood since its integration with ClearML, beside initing the task before the execution of the file like in my snippet is not sufficient ?
`
Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: (Reading database ... #015(Reading database ... 5%#015(Reading database ... 10%#015(Reading database ... 15%#015(Reading database ... 20%#015(Reading database ... 25%#015(Reading database ... 30%#015(Reading database ... 35%#015(Reading database ... 40%#015(Reading database ... 45%#015(Reading database ... 50%#015(Reading database ... 55%#015(Reading database ... 60%#015(Rea...
Thus the main difference of behavior must be coming from the _debug_execute_step_function property in the Controller class, currently skimming through it to try to identify a cause, did I provide you enough info btw CostlyOstrich36 ?