Reputation
Badges 1
47 × Eureka!diff --git a/docker-compose.yml b/docker-compose.diff.yml
index c6b49e1..07f7f43 100644
--- a/docker-compose.yml
+++ b/docker-compose.diff.yml
@@ -5,7 +5,7 @@ services:
command:
- apiserver
container_name: clearml-apiserver
- image: allegroai/clearml:1.15.0
+ image: allegroai/clearml:latest
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
@@ -19,17 +19,18 @@ services:
environment:
CLEARML_ELASTIC_SERVICE_HOST: elastics...
It's running v7.17.18 @<1722061389024989184:profile|ResponsiveKoala38>
Sorry for the late reply @<1722061389024989184:profile|ResponsiveKoala38> . So this is the diff between my local version (hosted together on a single server with docker-compose). Does anything spring to mind?
@<1722061389024989184:profile|ResponsiveKoala38> cool, thanks! I guess it will then be straightforward to script then.
What is your gut feeling regarding the size of the index? Is 87G a lot for an elastisearch index?
Any tips on how to check if we are storing data on deleted tasks? Maybe @<1722061389024989184:profile|ResponsiveKoala38> knows? Is there a field on each scalar that I can cross check with ClearML?
The lightning folks won't include new loggers anymore (since mid-2022, see None ) π
No, not at all. I recon we started seeing errors around mid-last week. We are using default settings for everything except some password-stuff on the server.
CostlyOstrich36 any thought on how we can further debug this? It's making ClearML practically useless for us
Sure. Really, I'm just using the default client:# ClearML SDK configuration file
api {
web_server: http://server.azure.com:8080
api_server: http://server.azure.com:8008
files_server: http://server.azure.com:8081
credentials {
"access_key" = "..."
"secret_key" = "..."
}
}
sdk {
# ClearML - default SDK configuration
storage {
cache {
# Defaults to system temp folder / cache
default_base_dir: "~/.clearml/c...
I just tried and the result is the same. The other method only triggers on exceptions
Perfect! Thanks SuccessfulKoala55 , that would be an acceptable workaround until setup_upload also supports Azure π π
Yeah, that makes sense. The only drawback is that you'll get a single point that all lines will go through in the Parallel Coordinates plot when the optimization finishes π
Hi again CostlyOstrich36 ,
I just wanted to share what ended up working for me. Basically I worked it out both for Hydra (thanks CurvedHedgehog15 ) and for PytorchLightningCLI.
So, for PL-CLI, I used this construct so we don't have to modify our training scripts based on our experiment tracker
` from pytorch_lightning.utilities.cli import LightningCLI
from clearml import Task
class MyCLI(LightningCLI):
def before_instantiate_classes(self) -> None:
# init the task
tas...
I've tried setting the output_uri
on Task.init, but that seems to only affect model checkpoints and artifacts
The server will never access the storage - only the clients (SDK/WebApp etc.) will access it
Oh okay. So that's the reason I can access media when the client and server is running on the same machine?
It's actually complementary - the SDK will use the clearml.conf configuration by matching that configuration with the destination you provided
Would you recommend doing both then? :-)
Hi @<1523701070390366208:profile|CostlyOstrich36>
Is 87G a lot for an index? Enough that you would consider adding more RAM?
And also, how can I check that we are not storing scalars for deleted tasks? ClearML used to write a lot of errors in the cleanup script, although that seems to have been fixed in recent updates
I think that you are absolutely correct. Thanks for the pointer!
@<1590514584836378624:profile|AmiableSeaturtle81> thatβs the service we are using :-)
How much RAM have you assigned to your elastic service?
Yes, I tried updating recently, it costed me a full days work of rolling back versions until I found something that worked π
Hi CurvedHedgehog15 , thanks for replying!
I guess that one could modify the config with variable interpolation (similar to how it's done in YAML, e.g. ${encoder.layers}
) - however, it seems to be quite invasive to specify that in our trainer script π
None for visibility
Hi CostlyOstrich36
What I'm seeing is expected behavior:
In my toy example, I have a VAE which is defined by a YAML config file and parsed with PytorchLightning CLI. Part of the config defines the latent dimension (n_latents) and the number of input channels of the decoder (in_channels). These two values needs to be the same. When I just use the Lightning CLI, I can use variable interpolation with OmegaConf like this:
` class_path: mymodel.VAE
init_args:
{...}
bottleneck:
class_pat...
Hi @<1523701070390366208:profile|CostlyOstrich36> , the task is being aborted via the web UI - I have another method that catches local interrupts (exceptions like keyboard interrupts and crashes). The case is equal for running tasks via agents or just local cli
We are running the latest version (WebApp: 1.7.0-232 β’ Server: 1.7.0-232 β’ API: 2.21).
When I run docker logs clearml-elastic
I get lots logs like this one:
{"type": "server", "timestamp": "2022-10-24T08:51:35,003Z", "level": "INFO", "component": "o.e.i.g.DatabaseNodeService", "cluster.name": "clearml", "node
.name": "clearml", "message": "successfully reloaded changed geoip database file [/tmp/elasticsearch-3596639242536548410/geoip-databases/cX7aMqJ4SwCxqM7s
YM-S9Q/GeoLite2-City.mmdb]...
Which version of the server are you running?
Sure. I'll give it a few minor releases and then try again π Thanks for the responses @<1722061389024989184:profile|ResponsiveKoala38> !
This is an example of the console output of a task aborted via the webUI:
Epoch 1/29 ββΈββββββββββββββββββββββββββββββββββββββ 699/16945 0:04:53 β’ 1:55:25 2.35it/s v_num: 0.000
2024-09-16 12:52:57,263 - clearml.Task - WARNING - ### TASK STOPPING - USER ABORTED - LAUNCHING CALLBACK (timeout 30.0 sec) ###
[2024-09-16 12:52:57,284][core.callbacks.model_checkpoint][INFO] - Marking task as `in_progress`
[2024-09-16 12:52:57,309][core.callbacks.model_checkpoint][INFO] - Saving last checkpoint...