
Reputation
Badges 1
131 × Eureka!Are you executing your script using the right python interpreter ?
/venv/bin/python my_clearml_script.py
As opposed to the Controller/Task component where the add_step()
only allows to sequentially execute them
Takling about that decorator which shouyld also have a docker_arg param since it is executed as an "orchestration component" but the param is missing https://clear.ml/docs/latest/docs/references/sdk/automation_controller_pipelinecontroller/#pipelinedecoratorpipeline
(if for instance in wanna pull a yolov5
repo in the retraining component)
In the meantime is there some way to set a retention policy for the dataset versions ?
Btw AgitatedDove14 is there a way to define parallel tasks and use pipeline as an acyclic compute graph instead of simply sequential tasks ?
Ia lready deleted ~/.clearml/cache
but I'll try deleting the entire folder
My bad, the specified file did not exists since I forgot to raise an exception if the export command failed >< Well I guess this is the reason, will test that on monday
The default compression parameter value is ZIP_MINIMAL_COMPRESSION
, I guess you could try to check if there is a Tarball only option but anyway most of the CPU time took by the upload process is the generation of the hashes of the file entries
Sure but the same pattern can be achieved using explicitly the PipelineController
class and defining steps using .add_step()
pointing to CLearML's Task
objects right ?
The decorators simply abstract away the controller but both methods (decorators or controller/tasks) allows to decouple your pipelines in steps each having an independent compute target, right ?
So basically choosing one method or the other only a question of best-practice or style ?
Hey SuccessfulKoala55 currently using the clearml
package version 1.7.1
and my server is a PRO SaaS deployment
Okay I confirm having default parameters fixes that issue, but kinda sad to have lost 3 days into that super weird behavior
Okay, thanks for the pointer ❤
It doesnt seem so if you look at the REST api documentation, might be available as an ENterprise plan feature
I would like instead of having to:
Fetch latest dataset to get the current latest version Increment the version number Create and upload a new version of the datasetTo be able to:
Select a dataset project by name Create a new version of the dataset by choosing what increment in SEMVER standard I would like to add for this version number (major/minor/patch) and upload
you correctly assigned a domain and certificate ?
Okay thanks! Please keep me posted when the hotfix is out on the SaaS
Okay looks like the call dependency resolver does not supports cross-file calls and relies instead on the local repo cloning feature to handle multiple files so the Task.force_store_standalone_script()
does not allow for a pipeline defined cross multiple files (now that you think of it it was kinda implied by the name), but what is interesting is that calling an auxiliary function in the SAME file from a component also raise a NameError: <function_name> is not defined
, that's ki...
Another crash on the same autoscaler instance:
`
2022-11-04 15:53:54
2022-11-04 14:53:50,393 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2022-11-04 14:53:51,092 - clearml.Auto-Scaler - INFO - 2415066998557416558 console log:
Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd[1]: var-lib-docker-overlay2-b04bca4c99cf94c31a3644236d70727aaa417fa4122e1b6c012e0ad908af24ef\x2dinit-merged.mount: Deactivated successfully.
Nov 4 14:53:29 clearml-w...
Hey CostlyOstrich36 I got another occurence of autoscaler crash with a similar backtrace, any updates on this issue?
`
2022-11-04 11:46:55
2022-11-04 10:46:51,644 - clearml.Auto-Scaler - INFO - 5839398111025911016 console log:
Starting Cleanup of Temporary Directories...
Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: Starting Cleanup of Temporary Directories...
Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: systemd-tmpfiles...
I can test it empirically but I want to be sure what is the expected behavior so my pipeline don't get auto-magically broken after a patch
I had the same issues too on some of my components and I had to specify them in the packages=["package-1", "package-2", ...]
in my @PipelineDecorator.component()
decorator parameters
So it seems to be an issue with the component parameter called in:
` @PipelineDecorator.pipeline(
name="VINZ Auto-Retrain",
project="VINZ",
version="0.0.1",
pipeline_execution_queue="Quad_VCPU_16GB"
)
def executing_pipeline(start_date, end_date):
print("Starting VINZ Auto-Retrain pipeline...")
print(f"Start date: {start_date}")
print(f"End date: {end_date}")
window_dataset_id = generate_dataset(start_date, end_date)
if name == 'main':
PipelineDec...
Did you properly install Docker and Docker nvidia toolkit ? here's the init script i'm using on my autoscaled workers:
#!/bin/sh
sudo apt-get update -y
sudo apt-get install -y \
ca-certificates \
curl \
gnupg \
lsb-release
sudo mkdir -p /etc/apt/keyrings
curl -fsSL
| sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg]
\
$(lsb_release -cs) stable" | s...
Well I uploaded datasets in the previous steps with the same credentials
Thanks @<1523701435869433856:profile|SmugDolphin23> , tho are you sure I don't need to override the deserialization function even if I pass multiple distinct objects as a tuple ?
Oh okay, my initial implementation was not far off:
` task = Task.init(project_name='VINZ', task_name=f'VINZ Retraining {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}')
task.set_progress(0)
print("Training model...")
os.system(train_cmd)
print("✔️ Model trained!")
task.set_progress(75)
print("Converting model to ONNX...")
os.system(f"python export.py --weights {os.path.join(training_data_path, 'runs', 'train', 'yolov5s6_results', 'weights', 'best.pt')} --img...
It seems like it, cause it's impossible to access an IP directly through https without using a domain name without certificate, it will solve this immediate problem at least