Reputation
Badges 1
131 × Eureka!Okay the force_store_standalone_script()
works
Well aside from the abvious removal of the line PipelineDecorator.run_locally()
on both our sides, the decorators arguments seems to be the same:@PipelineDecorator.component( return_values=['dataset_id'], cache=True, task_type=TaskTypes.data_processing, execution_queue='Quad_VCPU_16GB', repo=False )
And my pipeline controller:
` @PipelineDecorator.pipeline(
name="VINZ Auto-Retrain",
project="VINZ",
version="0.0.1",
pipeline_execution_queue="Quad_V...
(if for instance in wanna pull a yolov5
repo in the retraining component)
And Ithen can override it by specifying a repo on one of the components ?
Ah thank you I'll try that ASAP
The expected behavior is that the task would capture the iteration scalar of the PL trainer but nothing is recorded
import clearml
from darts.models import TFTModel
model = TFTModel(
input_chunk_length=28,
output_chunk_length=14,
n_epochs=300,
batch_size=4096,
add_relative_index=True,
num_attention_heads=4,
dropout=0.3,
full_attention=True,
save_checkpoints=True,
)
task = Task.init(
project_name='sales-prediction',
task_name='TFT Training 2'...
Sorry, I meant the scalar logging doesn't collect anything like it would do during a vanilla Pytorch Lightning training, here is the repo of the lib https://github.com/unit8co/darts
Make sure you have entered the commandusermod -aG docker $USER
On the VM you are running your agent on
looks like the user running your clearML agent is not added to the docker group
Why note define your pipeline using PipelineDecorator
instead, then you'll be able to call each of your pipeline components in a very pythonic way
Nice, that's a great feature! I'm also trying to have a component executing Giskard QA test suites on model and data, is there a planned feature when I can suspend execution of the pipeline, and display on the UI that this pipeline "steps" require a human confirmation to go on or stop while displaying arbitrary text/plot information ?
Sure but the same pattern can be achieved using explicitly the PipelineController
class and defining steps using .add_step()
pointing to CLearML's Task
objects right ?
The decorators simply abstract away the controller but both methods (decorators or controller/tasks) allows to decouple your pipelines in steps each having an independent compute target, right ?
So basically choosing one method or the other only a question of best-practice or style ?
Ooooo okay I see the @PipelineDecorator.pipeline
decorator you can have a function to orchestrate your components and manipulate their return data
Btw AgitatedDove14 is there a way to define parallel tasks and use pipeline as an acyclic compute graph instead of simply sequential tasks ?
As opposed to the Controller/Task component where the add_step()
only allows to sequentially execute them
I would try to not run it locally but in your execution queues on a remote worker, if that's not it it is likely a bug
THat make sense since this function executes your component as classic pythonic functions
The default compression parameter value is ZIP_MINIMAL_COMPRESSION
, I guess you could try to check if there is a Tarball only option but anyway most of the CPU time took by the upload process is the generation of the hashes of the file entries
Looks like you need the https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving and https://clear.ml/docs/latest/docs/pipelines/pipelines features with a https://clear.ml/pricing/ in SaaS deployment so you can use the https://clear.ml/docs/latest/docs/webapp/applications/apps_gcp_autoscaler to manage the workers for you
Hey Mathias,
The project SDK is pretty barebone and according to the doc you should use the REST API for further actions , the simplest approach would be to simply use the project id on the POST /projects.get_by_id
endpoint .
Best regards,
pip package clearml-agent
is version 1.3.0
Okay I confirm having default parameters fixes that issue, but kinda sad to have lost 3 days into that super weird behavior
Most of the time is took by building wheels for nympy
and pandas
which are apparently deps of clearml-agent
if I read the log correctly
Well having a network inbcidient at HQ so this doesn't help.... but I'll keep you updqted with the tests I run tommorow
Another crash on the same autoscaler instance:
`
2022-11-04 15:53:54
2022-11-04 14:53:50,393 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2022-11-04 14:53:51,092 - clearml.Auto-Scaler - INFO - 2415066998557416558 console log:
Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd[1]: var-lib-docker-overlay2-b04bca4c99cf94c31a3644236d70727aaa417fa4122e1b6c012e0ad908af24ef\x2dinit-merged.mount: Deactivated successfully.
Nov 4 14:53:29 clearml-w...
Hey CostlyOstrich36 I got another occurence of autoscaler crash with a similar backtrace, any updates on this issue?
`
2022-11-04 11:46:55
2022-11-04 10:46:51,644 - clearml.Auto-Scaler - INFO - 5839398111025911016 console log:
Starting Cleanup of Temporary Directories...
Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: Starting Cleanup of Temporary Directories...
Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: systemd-tmpfiles...
This is an instance than I launched like last week and was running fine until now, the version is v1.6.0-335