Reputation
Badges 1
50 × Eureka!Just wanted to share a workaround for using a TriggerScheduler to execute a script using the latest commit of a given branch, without relying on cloning a Task. Don't know if it has been shown before in here 🙂
from clearml import Model, Task
from clearml.automation import TriggerScheduler
def trigger_model_func(model_id: str):
model = Model(model_id)
print(f"Triggered model export for model '{model.name}' ({model_id})")
# NOTE: To execute from the branch of
# task...
Hi CostlyOstrich36
I have created a base task on which I'm optimizing hyperparameters. With clearml-param-search I could use --params-override to set a static parameter, which should not be optimized, e.g. changing the number of epochs for all experiments. It seems to me that this capability is not present in HyperParameterOptimizer . Does that make sense?
From the example on https://clear.ml/docs/latest/docs/apps/clearml_param_search/ :
` clearml-param-search {...} --p...
Hi CostlyOstrich36 , thanks for answering. We are using compute instances through the Machine Learning Studio in Azure. They basically work by spinning up an instance, loading a docker-image and executing a specific script in a folder that you upload along with the docker-image. Nothing is persisted between runs and there is no clear notion of a "user" (when thinking of ~/.clearml.conf at least).
SuccessfulKoala55 yeah, sorry, should have mentioned that our storage is also Azure (blob sto...
Sure. I'll give it a few minor releases and then try again 🙂 Thanks for the responses @<1722061389024989184:profile|ResponsiveKoala38> !
Yeah, that makes sense. The only drawback is that you'll get a single point that all lines will go through in the Parallel Coordinates plot when the optimization finishes 🙂
I think that you are absolutely correct. Thanks for the pointer!
CostlyOstrich36 any thought on how we can further debug this? It's making ClearML practically useless for us
Thanks for responding @<1523701087100473344:profile|SuccessfulKoala55> . Good question! One solution could be to create a new open-source project with lightning + clearml integrations and link it to the Lightning ecosystem-ci ; I believe most people use the basic tensorboard-logger with ClearML, but the extended usecase of a ClearML model checkpoint callback might make it valuable.
I guess one would have to disable auto-logging of p...
Specifically, this is what I get in the console log when the agent spins up a task:
Poetry Enabled: Ignoring requested python packages, using repository poetry lock file!
Creating virtualenv latent-features in /data/clearml/venvs-builds/3.9/task_repository/our-repo/.venv
Installing dependencies from lock file
It's running v7.17.18 @<1722061389024989184:profile|ResponsiveKoala38>
I don't have issues with setting the hyperparameters - I just would like to link changes to one hyperparameter (eg. encoder.layers ) to another parameter (e.g. http://decoder.in _layers ) when optimizing over encoder.layer
but is model files easier to serve?
Hi @<1523701070390366208:profile|CostlyOstrich36> , the task is being aborted via the web UI - I have another method that catches local interrupts (exceptions like keyboard interrupts and crashes). The case is equal for running tasks via agents or just local cli
Sorry for the late reply @<1722061389024989184:profile|ResponsiveKoala38> . So this is the diff between my local version (hosted together on a single server with docker-compose). Does anything spring to mind?
The lightning folks won't include new loggers anymore (since mid-2022, see None ) 🙂
Sorry, I got caught up by other tasks. I might investigate further later, but it's not top of mind right now. Our main issue is to get people to archive their old tasks and models so they can be cleaned up 😄
diff --git a/docker-compose.yml b/docker-compose.diff.yml
index c6b49e1..07f7f43 100644
--- a/docker-compose.yml
+++ b/docker-compose.diff.yml
@@ -5,7 +5,7 @@ services:
command:
- apiserver
container_name: clearml-apiserver
- image: allegroai/clearml:1.15.0
+ image: allegroai/clearml:latest
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
@@ -19,17 +19,18 @@ services:
environment:
CLEARML_ELASTIC_SERVICE_HOST: elastics...
Hi @<1523701070390366208:profile|CostlyOstrich36>
Is 87G a lot for an index? Enough that you would consider adding more RAM?
And also, how can I check that we are not storing scalars for deleted tasks? ClearML used to write a lot of errors in the cleanup script, although that seems to have been fixed in recent updates
Hi CurvedHedgehog15 , thanks for replying!
I guess that one could modify the config with variable interpolation (similar to how it's done in YAML, e.g. ${encoder.layers} ) - however, it seems to be quite invasive to specify that in our trainer script 😞
@<1523701070390366208:profile|CostlyOstrich36> any thoughts? Are the model files themselves easier to serve?
I just tried and the result is the same. The other method only triggers on exceptions
Well, one solution could be to say that models can only be exported from main/master and then have devops start a new trigger on PR completion. That would require some logic for stopping the existing TriggerScheduler, but that shouldn't be too difficult.
However, the most flexible solution would be to have some way of triggering the execution of a script in the parent task environment, something along the lines of clearml-agent build ... . I just can't wrap my head around triggering that ...
Which version of the server are you running?
@<1722061389024989184:profile|ResponsiveKoala38> cool, thanks! I guess it will then be straightforward to script then.
What is your gut feeling regarding the size of the index? Is 87G a lot for an elastisearch index?
Any tips on how to check if we are storing data on deleted tasks? Maybe @<1722061389024989184:profile|ResponsiveKoala38> knows? Is there a field on each scalar that I can cross check with ClearML?
Yes, I tried updating recently, it costed me a full days work of rolling back versions until I found something that worked 😅
Hi Martin,
It doesn't seem to work with dev.azure though:
Using user/pass credentials - replacing ssh url 'git@ssh.dev.azure.com:v3/ORG/TEAM/PROJECT' with https url '
'
fatal: repository '
' not found
The expected format for the https protocol is None .
Thoughts @<1523701205467926528:profile|AgitatedDove14> ?
No, not at all. I recon we started seeing errors around mid-last week. We are using default settings for everything except some password-stuff on the server.
Well, consider the case where you start the trigger scheduler on commit A, then you do some work that defines a new model and commit as commit B, train some model and now you want to export/deploy the model by publishing it and tagging it with some tag that triggers the export, as in your example. The scheduler will then fail, because the model is not implemented at commit A.
Anyways, I think I've solved it, I'll post the workaround when I get around to it 🙂
You can create a task in the t...