Reputation
Badges 1
131 × Eureka!And the config is:
` {
"gcp_project_id": "XXXX",
"gcp_zone": "europe-west1-b",
"gcp_credentials": "XXXX",
"git_user": "XXXX",
"git_pass": "XXXXX",
"default_docker_image": "XXXX",
"instance_queue_list": [
{
"resource_name": "gcp4cpu",
"machine_type": "c2-standard-4",
"cpu_only": true,
"gpu_type": "",
"gpu_count": 1,
"preemptible": false,
"regular_instance_rollback": false,
"regular_instance_rollback_timeout": 10,
"spot_ins...
I have a pipeline with a single component:
` @PipelineDecorator.component(
return_values=['dataset_id'],
cache=True,
task_type=TaskTypes.data_processing,
execution_queue='Quad_VCPU_16GB'
)
def generate_dataset(start_date: str, end_date: str, input_aws_credentials_profile: str = 'default'):
"""
Convert autocut logs from a specified time window into usable dataset in generic format.
"""
print('[STEP 1/4] Generating dataset from autocut logs...')
import os
...
Okay, thanks for the pointer ❤
Thanks @<1523701435869433856:profile|SmugDolphin23> , tho are you sure I don't need to override the deserialization function even if I pass multiple distinct objects as a tuple ?
Ah apparently reason was that the squash()
method defaults its output url to file_server
instead of the project's default storage string, might be nice to do the checks about storage validity before spawning sub-processes
If you're reffering to the https://www.nvidia.com/en-us/technologies/multi-instance-gpu/ I heard it was only supported by the Enterprise edition, since this tech is only available for the A100 GPUs, they most likely assumed that if you were rich enough to have one you would not mind buying the enterprise edition
And there was still 2 instances running from the last pipeline run
Another crash on the same autoscaler instance:
`
2022-11-04 15:53:54
2022-11-04 14:53:50,393 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2022-11-04 14:53:51,092 - clearml.Auto-Scaler - INFO - 2415066998557416558 console log:
Nov 4 14:53:29 clearml-worker-9357f6985dcc4f3c9d44b32a9ac2e09b systemd[1]: var-lib-docker-overlay2-b04bca4c99cf94c31a3644236d70727aaa417fa4122e1b6c012e0ad908af24ef\x2dinit-merged.mount: Deactivated successfully.
Nov 4 14:53:29 clearml-w...
Thus the main difference of behavior must be coming from the _debug_execute_step_function
property in the Controller
class, currently skimming through it to try to identify a cause, did I provide you enough info btw CostlyOstrich36 ?
There is a gap in the GPU offer on GCP and there is no modern middle-ground for a TPU with more than 16GB GRAM and less than 40GB, so sometime we need to provision a A100 to get the training speed we want but we don't use all the GRAM so I figured out if we could batch 2 training tasks on the same A100 instance we would still be on the winning side in term of CUDA cores and getting the most of the GPU-time we're paying.
This is funny cause the auto-scaler on GPU instances is working fine, but as the backtrace suggests it seems to be linked to this instance family
It's funny cause the line in the backtrace is the correct one so I don't think it has something to do with strange cachine behavior
Yup, so if I understand this is strictly an Enterprise feature and is not planned to be available in the Pro version ?
Hey CostlyOstrich36 I got another occurence of autoscaler crash with a similar backtrace, any updates on this issue?
`
2022-11-04 11:46:55
2022-11-04 10:46:51,644 - clearml.Auto-Scaler - INFO - 5839398111025911016 console log:
Starting Cleanup of Temporary Directories...
Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: Starting Cleanup of Temporary Directories...
Nov 4 10:46:46 clearml-worker-deb01e0837bb4b00865e4e72c90586c4 systemd[1]: systemd-tmpfiles...
And additionally does the When executing a Task (experiment) remotely, this method has no effect).
part means that if it is executed in a remote worker inside a pipeline without the dataset downloaded the method will have no effect ?
So basically CostlyOstrich36 I feel like debug_pipeline()
use the latest version of my code as it is defined on my filesystem but the run_locally()
used a previous version it cached somehow
You need to set a specific port for your clearml server and then just set a rule in your reverse proxy (eg. nginx) to make point the specific route you want toward that port
That might be an issue with clearml itself that fails to send proper resources if you change the path, that kind of path modification can be a hussle, if you have a domain name available I would suggest making a subdomain of it point to the ip of your clearml machine and just add site-enabled on nginx to point on it rather than doing a proxy pass
Takling about that decorator which shouyld also have a docker_arg param since it is executed as an "orchestration component" but the param is missing https://clear.ml/docs/latest/docs/references/sdk/automation_controller_pipelinecontroller/#pipelinedecoratorpipeline
Yes but not in the controller itself, which is also remotely executed in a docker container
Well I simply duplicated code across my components instead of centraliwing the operations that needed that env variable in the controller
I'm considering doing a PR in a few days to add the param if it is not too complex
Looks like you need the https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving and https://clear.ml/docs/latest/docs/pipelines/pipelines features with a https://clear.ml/pricing/ in SaaS deployment so you can use the https://clear.ml/docs/latest/docs/webapp/applications/apps_gcp_autoscaler to manage the workers for you
Did you properly install Docker and Docker nvidia toolkit ? here's the init script i'm using on my autoscaled workers:
#!/bin/sh
sudo apt-get update -y
sudo apt-get install -y \
ca-certificates \
curl \
gnupg \
lsb-release
sudo mkdir -p /etc/apt/keyrings
curl -fsSL
| sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg]
\
$(lsb_release -cs) stable" | s...
I would try to not run it locally but in your execution queues on a remote worker, if that's not it it is likely a bug
The new 1.7.2
is still in release candidates so nothing new since 20 days ago
BoredHedgehog47 here, enlight yourself https://github.com/allegroai/clearml/releases