Reputation
Badges 1
131 × Eureka!Fix confirmed on our side CostlyOstrich36 thanks for everything!
Old tags are not deleted. When executing a Task (experiment) remotely, this method has no effect).
This description in the add_tags()
doc intrigues me tho, I would like to remove a tag from a dataset and add it to another version (eg: a used_in_last_training
tag) and this method seems to only add new tags.
You need to set a specific port for your clearml server and then just set a rule in your reverse proxy (eg. nginx) to make point the specific route you want toward that port
I have a pipeline with a single component:
` @PipelineDecorator.component(
return_values=['dataset_id'],
cache=True,
task_type=TaskTypes.data_processing,
execution_queue='Quad_VCPU_16GB'
)
def generate_dataset(start_date: str, end_date: str, input_aws_credentials_profile: str = 'default'):
"""
Convert autocut logs from a specified time window into usable dataset in generic format.
"""
print('[STEP 1/4] Generating dataset from autocut logs...')
import os
...
Okay, thanks for the pointer ❤
That might be an issue with clearml itself that fails to send proper resources if you change the path, that kind of path modification can be a hussle, if you have a domain name available I would suggest making a subdomain of it point to the ip of your clearml machine and just add site-enabled on nginx to point on it rather than doing a proxy pass
Turns out the bucket
param expected was expecting the bucket name without the s3://
protocol specification, but now that this issue is fixed i still have the same incorrect region specified error
,
` task = Task.init(
project_name='XXXX',
task_name=f'Training-{training_uuid}',
task_type=Task.TaskTypes.training,
output_uri=f's3://{constants.CLEARML_BUCKET}'
)
task.setup_aws_upload(
bucket=constants.CLEARML_BUCKET,
regi...
Hey, I'm a SaaS user in PRO tier and I was wondering if it was a feature available on the auto-scaler apps so I could improve the cost-efficiency of my provisionned GCP A100 instances
Oh wow, would definitely try it out if there were an Autoscaler App integrating it with ClearML
I mean, if deleting tags in finalized datasets is possible in the GUI, it should be too in the SDK but I don't see the method
And the config is:
` {
"gcp_project_id": "XXXX",
"gcp_zone": "europe-west1-b",
"gcp_credentials": "XXXX",
"git_user": "XXXX",
"git_pass": "XXXXX",
"default_docker_image": "XXXX",
"instance_queue_list": [
{
"resource_name": "gcp4cpu",
"machine_type": "c2-standard-4",
"cpu_only": true,
"gpu_type": "",
"gpu_count": 1,
"preemptible": false,
"regular_instance_rollback": false,
"regular_instance_rollback_timeout": 10,
"spot_ins...
The simplest would be to have your reverse proxy (eg: nginx)on your GCP VM directly and redirect the requests to that domain toward the clearml-server container imho
Yup, if you want to access it through https you're required to have a domain pointing to that IP with a certificate in place (using letsencrypt as instance) or else you'll get some SSL error
you correctly assigned a domain and certificate ?
I had the same issues too on some of my components and I had to specify them in the packages=["package-1", "package-2", ...]
in my @PipelineDecorator.component()
decorator parameters
`
Oct 24 12:12:51 clearml-worker-446f930fe7ce4aabb597c73b3d98c837 google_metadata_script_runner[1473]: startup-script: (Reading database ... #015(Reading database ... 5%#015(Reading database ... 10%#015(Reading database ... 15%#015(Reading database ... 20%#015(Reading database ... 25%#015(Reading database ... 30%#015(Reading database ... 35%#015(Reading database ... 40%#015(Reading database ... 45%#015(Reading database ... 50%#015(Reading database ... 55%#015(Reading database ... 60%#015(Rea...
Hey CostlyOstrich36 I got another occurence of autoscaler crash with a similar backtrace, any updates on this issue?
`
2022-11-04 11:46:55
2022-11-04 10:46:51,644 - clearml.Auto-Scaler - INFO - 5839398111025911016 console log:
Starting Cleanup of Temporary Directories...
Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: Starting Cleanup of Temporary Directories...
Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: systemd-tmpfiles...
Thus the main difference of behavior must be coming from the _debug_execute_step_function
property in the Controller
class, currently skimming through it to try to identify a cause, did I provide you enough info btw CostlyOstrich36 ?
So basically CostlyOstrich36 I feel like debug_pipeline()
use the latest version of my code as it is defined on my filesystem but the run_locally()
used a previous version it cached somehow
Ia lready deleted ~/.clearml/cache
but I'll try deleting the entire folder
I suppose you cannot reproduct the issue from your side ?
Maybe it has to do that the faulty code was initially defined as a cached component
It's funny cause the line in the backtrace is the correct one so I don't think it has something to do with strange cachine behavior
So it seems to be an issue with the component parameter called in:
` @PipelineDecorator.pipeline(
name="VINZ Auto-Retrain",
project="VINZ",
version="0.0.1",
pipeline_execution_queue="Quad_VCPU_16GB"
)
def executing_pipeline(start_date, end_date):
print("Starting VINZ Auto-Retrain pipeline...")
print(f"Start date: {start_date}")
print(f"End date: {end_date}")
window_dataset_id = generate_dataset(start_date, end_date)
if name == 'main':
PipelineDec...
CostlyOstrich36 Having the same issue running on a remote worker, even tho the line works correctly on python interpreter and the component run correctly in local debug mode (but not standard local mode):
` File "/root/.clearml/venvs-builds/3.10/code/generate_dataset.py", line 18, in generate_dataset
time_range = pd.date_range(start=start_date, end=end_date, freq='D').to_pydatetime().tolist()
File "/root/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/pandas/core/indexes/date...
CostlyOstrich36 Should I start a new issue since I pinpointed the exact problem given than the beginning of this one was clearly confusing for both of us ?
Okay thanks! Please keep me posted when the hotfix is out on the SaaS
I'm reffering https://clearml.slack.com/archives/CTK20V944/p1668070109678489?thread_ts=1667555788.111289&cid=CTK20V944 mapping the project to ClearML project and https://github.com/ultralytics/yolov5/tree/master/utils/loggers/clearml that when calling the trainin g.py from my machine successfully logged the training on clearML and uploaded the artifact correctly
The train.py
is the default YOLOv5 training file, I initiated the task outside the call, should I go edit their training command-line file ?
The worker docker image was running on python 3.8 and weare running on a PRO tier SaaS deployment, this failed run is from a few weeks ago and we did not run any pipeline since then