Reputation
Badges 1
50 × Eureka!Hi again CostlyOstrich36 ,
I just wanted to share what ended up working for me. Basically I worked it out both for Hydra (thanks CurvedHedgehog15 ) and for PytorchLightningCLI.
So, for PL-CLI, I used this construct so we don't have to modify our training scripts based on our experiment tracker
` from pytorch_lightning.utilities.cli import LightningCLI
from clearml import Task
class MyCLI(LightningCLI):
def before_instantiate_classes(self) -> None:
# init the task
tas...
@<1590514584836378624:profile|AmiableSeaturtle81> thatβs the service we are using :-)
How much RAM have you assigned to your elastic service?
Perfect! Thanks SuccessfulKoala55 , that would be an acceptable workaround until setup_upload also supports Azure π π
@<1576381444509405184:profile|ManiacalLizard2> what happens when ES hits the limit? Does it go OOM, or does the scalars loading just take a long time in the web-ui? And what about tasks putting scalars in the index?
This is an example of the console output of a task aborted via the webUI:
Epoch 1/29 ββΈββββββββββββββββββββββββββββββββββββββ 699/16945 0:04:53 β’ 1:55:25 2.35it/s v_num: 0.000
2024-09-16 12:52:57,263 - clearml.Task - WARNING - ### TASK STOPPING - USER ABORTED - LAUNCHING CALLBACK (timeout 30.0 sec) ###
[2024-09-16 12:52:57,284][core.callbacks.model_checkpoint][INFO] - Marking task as `in_progress`
[2024-09-16 12:52:57,309][core.callbacks.model_checkpoint][INFO] - Saving last checkpoint...
Hi CostlyOstrich36
What I'm seeing is expected behavior:
In my toy example, I have a VAE which is defined by a YAML config file and parsed with PytorchLightning CLI. Part of the config defines the latent dimension (n_latents) and the number of input channels of the decoder (in_channels). These two values needs to be the same. When I just use the Lightning CLI, I can use variable interpolation with OmegaConf like this:
` class_path: mymodel.VAE
init_args:
{...}
bottleneck:
class_pat...
We are running the latest version (WebApp: 1.7.0-232 β’ Server: 1.7.0-232 β’ API: 2.21).
When I run docker logs clearml-elastic I get lots logs like this one:
{"type": "server", "timestamp": "2022-10-24T08:51:35,003Z", "level": "INFO", "component": "o.e.i.g.DatabaseNodeService", "cluster.name": "clearml", "node
.name": "clearml", "message": "successfully reloaded changed geoip database file [/tmp/elasticsearch-3596639242536548410/geoip-databases/cX7aMqJ4SwCxqM7s
YM-S9Q/GeoLite2-City.mmdb]...
Hi @<1523701070390366208:profile|CostlyOstrich36> , yeah we figured as much. Is there a setting in the server that limits logging - or disables it completely?
Hi @<1523701087100473344:profile|SuccessfulKoala55> , thanks for responding. I've found out that my first error came from cloning a super old version of the clean up task in the web UI π
I don't know about the other error, to me it looks like the task gets deleted before handling errors, but since an error occurred (some 404 stuff, maybe the files actually aren't there) when deleting some artifacts on the task, clearml tries to reload the task and fails, with the 400/201 or 400/101. ...
It's actually complementary - the SDK will use the clearml.conf configuration by matching that configuration with the destination you provided
Would you recommend doing both then? :-)
Sure. Really, I'm just using the default client:# ClearML SDK configuration file
api {
web_server: http://server.azure.com:8080
api_server: http://server.azure.com:8008
files_server: http://server.azure.com:8081
credentials {
"access_key" = "..."
"secret_key" = "..."
}
}
sdk {
# ClearML - default SDK configuration
storage {
cache {
# Defaults to system temp folder / cache
default_base_dir: "~/.clearml/c...
I've tried setting the output_uri on Task.init, but that seems to only affect model checkpoints and artifacts
SuccessfulKoala55 Thanks for the help. I've setup my client to use my blob storage now, and it works wonderfully.
I've also added a token to my server, so now I can access the audio samples from the server.
Is there a way to add a common token serverside so the other members of the team don't have to create a token?
I also struggle a bit with report_matplotlib_figure() in which plots does not appear in the web ui. I have implemented the following snippet in my pytorch lightning logger:
` @...
Do you mean to the Web UI?
Yes that's what I meant, sorry I'm still coming to terms with ClearML terminology π . Is it possible to store the web app cloud access token serverside so we don't have to input it in the Web UI? π
How does it look in the Web UI?
I just had a look, and they are visible under debug samples, but not under plots, as I had expected.
I thought that by using report_matplotlib_figure it would get grouped under plots? π
Hey SweetBadger76 , thanks for answering. I'll check it out! Does that correspond to filling out azure.storage in the clearml.conf file?
And how do I ensure that the server can access the files from the blob storage?
The server will never access the storage - only the clients (SDK/WebApp etc.) will access it
Oh okay. So that's the reason I can access media when the client and server is running on the same machine?
On the server or the client? :)
@<1590514584836378624:profile|AmiableSeaturtle81> this was last time i tried: https://clearml.slack.com/archives/CTK20V944/p1725534932820309
None for visibility