Reputation
Badges 1
662 × Eureka!We’re using karpenter
(more magic keywords for me), so my understanding is that that will manage the scaling part.
Much much appreciated 🙏
What do you mean 😄 Using logging.config.dictConfig(...)
I'll try it out, but I would not like to rewrite that code myself maintain it, that's my point 😅
Or are you suggesting I Task.import_offline_session
?
I'd like to set up both with and without GPUs. I can use any region, preferably some EU one.
CostlyOstrich36 That looks promising, but I don't see any documentation on the returned schema (i.e. workers.worker_stats
is not specified anywhere?)
We have an internal mono-repo and some of the packages are required - they’re all available correctly for the controller, only some are required for the individual tasks, but the “magic” doesn’t happen 😞
That is, the controller does not identify them as a requirement, so they’re not installed in the tasks environment.
It’s just that for the packages
argument, ClearML says:
If not provided, packages are automatically added based on the imports used inside the wrapped function.
So… 🤔
I can also do this via Mongo directly, but I was hoping to skip the K8S interaction there.
Any follow up thoughts SuccessfulKoala55 or CostlyOstrich36 ?
Right, so where can one find documentation about it?
The repo just has the variables with not much explanations.
The deferred_init
input argument to Task.init
is bool
by default, so checking type(deferred_init) == int
makes no sense to begin with, and is altering the flow.
Last but not least - can I cancel the offline zip creation if I'm not interested in it 🤔
EDIT: I see not, guess one has to patch ZipFile
...
We have the following, works fine (we also use internal zip packaging for our models):
model = OutputModel(task=self.task, name=self.job_name, tags=kwargs.get('tags', self.task.get_tags()), framework=framework)
model.connect(task=self.task, name=self.job_name)
model.update_weights(weights_filename=cc_model.save())
@<1523701070390366208:profile|CostlyOstrich36> I added None btw
FWIW running clearml
==1.9.1
with WebApp: 1.9.2-317 • Server: 1.9.2-317 • API: 2.23
Happens with the latest version indeed.
I can’t share our code, but the gist of it is:
pipe = PipelineController(name=..., project=..., version=...)
pipe.add_function_step(...) # Many calls
pipe.set_default_execution_queue(...)
pipe.start(queue=..., wait=True)
Nothing I can spot --
ClearML results page:
ClearML pipeline page:
Launching the next 2 steps
Launching step [...]
Launching step [...]
Launching step: ...
Parameters:
{...}
Configurations:
{}
Overrides:
{}
Launching step: ...
Parameters:
{...}
Configurations:
{}
Overrides:
{}
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
2023-02-21 13:53:48
ClearML Monitor: Could not detect iteration reporting, falling back to itera...
I wouldn't put past ClearML automation (a lot of stuff depend on certain suffixes), but I don't think that's the case here hmm
So the pipeline runs successfully, I can find all the different tasks, but I cannot see them in the Pipelines tab…
Thanks SuccessfulKoala55 and AgitatedDove14 ! We'll go through the hoops of setting up mongo on AWS then.
We're working to decouple the data from the helm chart, seems like a dangerous idea to store long term data on k8s in case of failure 😅
@<1523704157695905792:profile|VivaciousBadger56> It seems like whatever you pickled in the zip file relies on some additional files that are not pickled.
FWIW It’s also listed in other places @<1523704157695905792:profile|VivaciousBadger56> , e.g. None says:
In order to make sure we also automatically upload the model snapshot (instead of saving its local path), we need to pass a storage location for the model files to be uploaded to.
For example, upload all snapshots to an S3 bucket…
I can only say I’ve found ClearML to be very helpful, even given the documentation issue.
I think they’ve been working on upgrading it for a while, hopefully something new comes out soon.
Maybe @<1523701205467926528:profile|AgitatedDove14> has further info 🙂
Well you could start by setting the output_uri
to True
in Task.init
.