
Reputation
Badges 1
50 × Eureka!CostlyOstrich36 any thought on how we can further debug this? It's making ClearML practically useless for us
but is model files easier to serve?
Hey SweetBadger76 , thanks for answering. I'll check it out! Does that correspond to filling out azure.storage
in the clearml.conf file?
And how do I ensure that the server can access the files from the blob storage?
It's running v7.17.18 @<1722061389024989184:profile|ResponsiveKoala38>
Sure. I'll give it a few minor releases and then try again π Thanks for the responses @<1722061389024989184:profile|ResponsiveKoala38> !
The server will never access the storage - only the clients (SDK/WebApp etc.) will access it
Oh okay. So that's the reason I can access media when the client and server is running on the same machine?
Sure. Really, I'm just using the default client:# ClearML SDK configuration file
api {
web_server: http://server.azure.com:8080
api_server: http://server.azure.com:8008
files_server: http://server.azure.com:8081
credentials {
"access_key" = "..."
"secret_key" = "..."
}
}
sdk {
# ClearML - default SDK configuration
storage {
cache {
# Defaults to system temp folder / cache
default_base_dir: "~/.clearml/c...
Hi CurvedHedgehog15 , thanks for replying!
I guess that one could modify the config with variable interpolation (similar to how it's done in YAML, e.g. ${encoder.layers}
) - however, it seems to be quite invasive to specify that in our trainer script π
We are running the latest version (WebApp: 1.7.0-232 β’ Server: 1.7.0-232 β’ API: 2.21).
When I run docker logs clearml-elastic
I get lots logs like this one:
{"type": "server", "timestamp": "2022-10-24T08:51:35,003Z", "level": "INFO", "component": "o.e.i.g.DatabaseNodeService", "cluster.name": "clearml", "node
.name": "clearml", "message": "successfully reloaded changed geoip database file [/tmp/elasticsearch-3596639242536548410/geoip-databases/cX7aMqJ4SwCxqM7s
YM-S9Q/GeoLite2-City.mmdb]...
Yeah, that makes sense. The only drawback is that you'll get a single point that all lines will go through in the Parallel Coordinates plot when the optimization finishes π
Sorry for the late reply @<1722061389024989184:profile|ResponsiveKoala38> . So this is the diff between my local version (hosted together on a single server with docker-compose). Does anything spring to mind?
Hi CostlyOstrich36
What I'm seeing is expected behavior:
In my toy example, I have a VAE which is defined by a YAML config file and parsed with PytorchLightning CLI. Part of the config defines the latent dimension (n_latents) and the number of input channels of the decoder (in_channels). These two values needs to be the same. When I just use the Lightning CLI, I can use variable interpolation with OmegaConf like this:
` class_path: mymodel.VAE
init_args:
{...}
bottleneck:
class_pat...
@<1722061389024989184:profile|ResponsiveKoala38> cool, thanks! I guess it will then be straightforward to script then.
What is your gut feeling regarding the size of the index? Is 87G a lot for an elastisearch index?
@<1576381444509405184:profile|ManiacalLizard2> what happens when ES hits the limit? Does it go OOM, or does the scalars loading just take a long time in the web-ui? And what about tasks putting scalars in the index?
Hi CostlyOstrich36 , thanks for answering. We are using compute instances through the Machine Learning Studio in Azure. They basically work by spinning up an instance, loading a docker-image and executing a specific script in a folder that you upload along with the docker-image. Nothing is persisted between runs and there is no clear notion of a "user" (when thinking of ~/.clearml.conf at least).
SuccessfulKoala55 yeah, sorry, should have mentioned that our storage is also Azure (blob sto...
I think that you are absolutely correct. Thanks for the pointer!
Sorry, I got caught up by other tasks. I might investigate further later, but it's not top of mind right now. Our main issue is to get people to archive their old tasks and models so they can be cleaned up π
Hi @<1523701070390366208:profile|CostlyOstrich36>
Is 87G a lot for an index? Enough that you would consider adding more RAM?
And also, how can I check that we are not storing scalars for deleted tasks? ClearML used to write a lot of errors in the cleanup script, although that seems to have been fixed in recent updates
Hi again CostlyOstrich36 ,
I just wanted to share what ended up working for me. Basically I worked it out both for Hydra (thanks CurvedHedgehog15 ) and for PytorchLightningCLI.
So, for PL-CLI, I used this construct so we don't have to modify our training scripts based on our experiment tracker
` from pytorch_lightning.utilities.cli import LightningCLI
from clearml import Task
class MyCLI(LightningCLI):
def before_instantiate_classes(self) -> None:
# init the task
tas...
It's actually complementary - the SDK will use the clearml.conf configuration by matching that configuration with the destination you provided
Would you recommend doing both then? :-)
No, not at all. I recon we started seeing errors around mid-last week. We are using default settings for everything except some password-stuff on the server.
Well, consider the case where you start the trigger scheduler on commit A, then you do some work that defines a new model and commit as commit B, train some model and now you want to export/deploy the model by publishing it and tagging it with some tag that triggers the export, as in your example. The scheduler will then fail, because the model is not implemented at commit A.
Anyways, I think I've solved it, I'll post the workaround when I get around to it π
You can create a task in the t...
SuccessfulKoala55 Thanks for the help. I've setup my client to use my blob storage now, and it works wonderfully.
I've also added a token to my server, so now I can access the audio samples from the server.
Is there a way to add a common token serverside so the other members of the team don't have to create a token?
I also struggle a bit with report_matplotlib_figure() in which plots does not appear in the web ui. I have implemented the following snippet in my pytorch lightning logger:
` @...
The lightning folks won't include new loggers anymore (since mid-2022, see None ) π
On the server or the client? :)
This is an example of the console output of a task aborted via the webUI:
Epoch 1/29 ββΈββββββββββββββββββββββββββββββββββββββ 699/16945 0:04:53 β’ 1:55:25 2.35it/s v_num: 0.000
2024-09-16 12:52:57,263 - clearml.Task - WARNING - ### TASK STOPPING - USER ABORTED - LAUNCHING CALLBACK (timeout 30.0 sec) ###
[2024-09-16 12:52:57,284][core.callbacks.model_checkpoint][INFO] - Marking task as `in_progress`
[2024-09-16 12:52:57,309][core.callbacks.model_checkpoint][INFO] - Saving last checkpoint...
Hi @<1523701070390366208:profile|CostlyOstrich36> , yeah we figured as much. Is there a setting in the server that limits logging - or disables it completely?
Well, one solution could be to say that models can only be exported from main/master and then have devops start a new trigger on PR completion. That would require some logic for stopping the existing TriggerScheduler, but that shouldn't be too difficult.
However, the most flexible solution would be to have some way of triggering the execution of a script in the parent task environment, something along the lines of clearml-agent build ...
. I just can't wrap my head around triggering that ...