Reputation
Badges 1
131 × Eureka!If you're reffering to the https://www.nvidia.com/en-us/technologies/multi-instance-gpu/ I heard it was only supported by the Enterprise edition, since this tech is only available for the A100 GPUs, they most likely assumed that if you were rich enough to have one you would not mind buying the enterprise edition
Hey, I'm a SaaS user in PRO tier and I was wondering if it was a feature available on the auto-scaler apps so I could improve the cost-efficiency of my provisionned GCP A100 instances
Well I simply duplicated code across my components instead of centraliwing the operations that needed that env variable in the controller
Okay! Tho I only see a param to specify a weights url while I'm looking to upload local weights
Ooooo okay I see the @PipelineDecorator.pipeline
decorator you can have a function to orchestrate your components and manipulate their return data
As opposed to the Controller/Task component where the add_step()
only allows to sequentially execute them
Thanks a lot @<1523701435869433856:profile|SmugDolphin23> ❤
Thanks @<1523701435869433856:profile|SmugDolphin23> , tho are you sure I don't need to override the deserialization function even if I pass multiple distinct objects as a tuple ?
looks like the user running your clearML agent is not added to the docker group
The worker docker image was running on python 3.8 and weare running on a PRO tier SaaS deployment, this failed run is from a few weeks ago and we did not run any pipeline since then
The train.py
is the default YOLOv5 training file, I initiated the task outside the call, should I go edit their training command-line file ?
Well aside from the abvious removal of the line PipelineDecorator.run_locally()
on both our sides, the decorators arguments seems to be the same:@PipelineDecorator.component( return_values=['dataset_id'], cache=True, task_type=TaskTypes.data_processing, execution_queue='Quad_VCPU_16GB', repo=False )
And my pipeline controller:
` @PipelineDecorator.pipeline(
name="VINZ Auto-Retrain",
project="VINZ",
version="0.0.1",
pipeline_execution_queue="Quad_V...
Ah thank you I'll try that ASAP
If you're using Helm it would be at the service level in your values.yml
, not pod level
If you feel you have a specific enough issue you can also post a github issue and link this thread to it
Can reproduce on Pro SaaS deployment on Firefox 105.0.3
That might be an issue with clearml itself that fails to send proper resources if you change the path, that kind of path modification can be a hussle, if you have a domain name available I would suggest making a subdomain of it point to the ip of your clearml machine and just add site-enabled on nginx to point on it rather than doing a proxy pass
SmugDolphin23 But the training.py has already a CLearML task created under the hood since its integration with ClearML, beside initing the task before the execution of the file like in my snippet is not sufficient ?
And by extension is there a way to upsert a dataset by automatically creating an entry wich a incremented version or create it if it does not exists ? Or am I forced to do a get, check if the latest version is fainallyzed, then increment de version of that version and create my new version ?
AgitatedDove14 I have annotation logs from the end-user that I fetch periodically, I process it and I want to add it as a new version of my dataset where all versions correspond to the data collected during a precise time window, currently I'm doing it by fetching the latest dataset, incrementing the versionmm and creating a new dataset version
The pipeline log indicate the same version of Pandas ( 1.5.0
) is installed, I really don't know what is happening
Nice, thank you for the reactivity ❤
Would gladly try to run it on a remote instance to verify the thesis on some local cache acting up but unfortunately also ran into an issue with the GCP autoscaler https://clearml.slack.com/archives/CTK20V944/p1665664690293529
There is a gap in the GPU offer on GCP and there is no modern middle-ground for a TPU with more than 16GB GRAM and less than 40GB, so sometime we need to provision a A100 to get the training speed we want but we don't use all the GRAM so I figured out if we could batch 2 training tasks on the same A100 instance we would still be on the winning side in term of CUDA cores and getting the most of the GPU-time we're paying.
I doubt there is a direct way to do it since they are stored as archive chunks 😕
Ia lready deleted ~/.clearml/cache
but I'll try deleting the entire folder
CostlyOstrich36 Should I start a new issue since I pinpointed the exact problem given than the beginning of this one was clearly confusing for both of us ?
Oh wow, would definitely try it out if there were an Autoscaler App integrating it with ClearML