Reputation
Badges 1
47 × Eureka!Hi @<1523701087100473344:profile|SuccessfulKoala55> , thanks for responding. I've found out that my first error came from cloning a super old version of the clean up task in the web UI π
I don't know about the other error, to me it looks like the task gets deleted before handling errors, but since an error occurred (some 404 stuff, maybe the files actually aren't there) when deleting some artifacts on the task, clearml tries to reload the task and fails, with the 400/201 or 400/101. ...
Just wanted to share a workaround for using a TriggerScheduler to execute a script using the latest commit of a given branch, without relying on cloning a Task. Don't know if it has been shown before in here π
from clearml import Model, Task
from clearml.automation import TriggerScheduler
def trigger_model_func(model_id: str):
model = Model(model_id)
print(f"Triggered model export for model '{model.name}' ({model_id})")
# NOTE: To execute from the branch of
# task...
Hey SweetBadger76 , thanks for answering. I'll check it out! Does that correspond to filling out azure.storage
in the clearml.conf file?
And how do I ensure that the server can access the files from the blob storage?
How does it look in the Web UI?
I just had a look, and they are visible under debug samples, but not under plots, as I had expected.
I thought that by using report_matplotlib_figure
it would get grouped under plots? π
SuccessfulKoala55 Thanks for the help. I've setup my client to use my blob storage now, and it works wonderfully.
I've also added a token to my server, so now I can access the audio samples from the server.
Is there a way to add a common token serverside so the other members of the team don't have to create a token?
I also struggle a bit with report_matplotlib_figure() in which plots does not appear in the web ui. I have implemented the following snippet in my pytorch lightning logger:
` @...
On the server or the client? :)
We are running the latest version (WebApp: 1.7.0-232 β’ Server: 1.7.0-232 β’ API: 2.21).
When I run docker logs clearml-elastic
I get lots logs like this one:
{"type": "server", "timestamp": "2022-10-24T08:51:35,003Z", "level": "INFO", "component": "o.e.i.g.DatabaseNodeService", "cluster.name": "clearml", "node
.name": "clearml", "message": "successfully reloaded changed geoip database file [/tmp/elasticsearch-3596639242536548410/geoip-databases/cX7aMqJ4SwCxqM7s
YM-S9Q/GeoLite2-City.mmdb]...
Hi CostlyOstrich36
What I'm seeing is expected behavior:
In my toy example, I have a VAE which is defined by a YAML config file and parsed with PytorchLightning CLI. Part of the config defines the latent dimension (n_latents) and the number of input channels of the decoder (in_channels). These two values needs to be the same. When I just use the Lightning CLI, I can use variable interpolation with OmegaConf like this:
` class_path: mymodel.VAE
init_args:
{...}
bottleneck:
class_pat...
The lightning folks won't include new loggers anymore (since mid-2022, see None ) π
No, not at all. I recon we started seeing errors around mid-last week. We are using default settings for everything except some password-stuff on the server.
This is an example of the console output of a task aborted via the webUI:
Epoch 1/29 ββΈββββββββββββββββββββββββββββββββββββββ 699/16945 0:04:53 β’ 1:55:25 2.35it/s v_num: 0.000
2024-09-16 12:52:57,263 - clearml.Task - WARNING - ### TASK STOPPING - USER ABORTED - LAUNCHING CALLBACK (timeout 30.0 sec) ###
[2024-09-16 12:52:57,284][core.callbacks.model_checkpoint][INFO] - Marking task as `in_progress`
[2024-09-16 12:52:57,309][core.callbacks.model_checkpoint][INFO] - Saving last checkpoint...
Hi @<1523701070390366208:profile|CostlyOstrich36> , the task is being aborted via the web UI - I have another method that catches local interrupts (exceptions like keyboard interrupts and crashes). The case is equal for running tasks via agents or just local cli
I just tried and the result is the same. The other method only triggers on exceptions
Hi again CostlyOstrich36 ,
I just wanted to share what ended up working for me. Basically I worked it out both for Hydra (thanks CurvedHedgehog15 ) and for PytorchLightningCLI.
So, for PL-CLI, I used this construct so we don't have to modify our training scripts based on our experiment tracker
` from pytorch_lightning.utilities.cli import LightningCLI
from clearml import Task
class MyCLI(LightningCLI):
def before_instantiate_classes(self) -> None:
# init the task
tas...
Do you mean to the Web UI?
Yes that's what I meant, sorry I'm still coming to terms with ClearML terminology π . Is it possible to store the web app cloud access token serverside so we don't have to input it in the Web UI? π
The server will never access the storage - only the clients (SDK/WebApp etc.) will access it
Oh okay. So that's the reason I can access media when the client and server is running on the same machine?
Thanks for responding @<1523701087100473344:profile|SuccessfulKoala55> . Good question! One solution could be to create a new open-source project with lightning + clearml integrations and link it to the Lightning ecosystem-ci ; I believe most people use the basic tensorboard-logger with ClearML, but the extended usecase of a ClearML model checkpoint callback might make it valuable.
I guess one would have to disable auto-logging of p...
I've tried setting the output_uri
on Task.init, but that seems to only affect model checkpoints and artifacts
None for visibility
Well, consider the case where you start the trigger scheduler on commit A, then you do some work that defines a new model and commit as commit B, train some model and now you want to export/deploy the model by publishing it and tagging it with some tag that triggers the export, as in your example. The scheduler will then fail, because the model is not implemented at commit A.
Anyways, I think I've solved it, I'll post the workaround when I get around to it π
You can create a task in the t...
CostlyOstrich36 any thought on how we can further debug this? It's making ClearML practically useless for us
I don't have issues with setting the hyperparameters - I just would like to link changes to one hyperparameter (eg. encoder.layers
) to another parameter (e.g. http://decoder.in _layers
) when optimizing over encoder.layer
diff --git a/docker-compose.yml b/docker-compose.diff.yml
index c6b49e1..07f7f43 100644
--- a/docker-compose.yml
+++ b/docker-compose.diff.yml
@@ -5,7 +5,7 @@ services:
command:
- apiserver
container_name: clearml-apiserver
- image: allegroai/clearml:1.15.0
+ image: allegroai/clearml:latest
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
@@ -19,17 +19,18 @@ services:
environment:
CLEARML_ELASTIC_SERVICE_HOST: elastics...
Sorry for the late reply @<1722061389024989184:profile|ResponsiveKoala38> . So this is the diff between my local version (hosted together on a single server with docker-compose). Does anything spring to mind?
It's running v7.17.18 @<1722061389024989184:profile|ResponsiveKoala38>
Hi @<1523701070390366208:profile|CostlyOstrich36>
Is 87G a lot for an index? Enough that you would consider adding more RAM?
And also, how can I check that we are not storing scalars for deleted tasks? ClearML used to write a lot of errors in the cleanup script, although that seems to have been fixed in recent updates
@<1722061389024989184:profile|ResponsiveKoala38> cool, thanks! I guess it will then be straightforward to script then.
What is your gut feeling regarding the size of the index? Is 87G a lot for an elastisearch index?
@<1590514584836378624:profile|AmiableSeaturtle81> this was last time i tried: https://clearml.slack.com/archives/CTK20V944/p1725534932820309