data:image/s3,"s3://crabby-images/ea8fc/ea8fc4a242d3fbf9f124d8906a48b69b89ea53a2" alt="Profile picture"
Reputation
Badges 1
25 × Eureka!Hi UnevenDolphin73
Maybe. When the container spins, are there any identifiers regarding the task etc available?
You mean at the container level or at clearml?
I create a folder on the bucket perΒ
python train.py
Β so that the environment variables files doesn't get overwritten if two users execute almost-simultaneously
Nice π I have an idea, how about per user ID? then they can access their "secrets" based on the owner of the Task ?task.data.user
It works. However, still, it sometimes takes a strangely long time for the agent to pick up the next task (or process it), even if it is only "Hello World".
The agent check every 2/5 seconds if there is a new Task to be launched, could that be it?
Hi WackyRabbit7
First always check the functions on the Task object, they are the most straight forward access to the system.
Then if you need general purpose API calls, currently they are only documented in the doc-string of the API schema (that said it should be quite documented)
You can check all the endpoints https://github.com/allegroai/trains/tree/master/trains/backend_api/services/v2_8
And finally if you want to easily use the RestAPI :
` from trains.backend_api.session.client impo...
Then running by using the
, am I right?
yep
I have put the
--save-period
while running Yolov5 and ClearML does not save the weight per epoch that I have trained. Why is this happened?
But do you still see it in the clearml UI ? do you see the models logged in the clearml UI ?
Hmm, let me see if you can somehow "signal" to the subprocess that it should not use the main process Task. (btw: are you forking or spawning a subprocess?)
what's the error/reply ?
he said it was something in the nginx config though
That makes sense π
EnviousStarfish54 a fix is already available in the latest RC
Could you verify it solves your issue as well?pip install trains==0.16.2rc0
Hi @<1716987933514272768:profile|SuccessfulPuppy43>
How to make remote ClearML agent do
pip install -e .
in theory there is no need to do that clearml-agent adds the repo root folder to the python path.
If you insist on actually installing it, try to add to your "installed packages" section a "requirement.txt" compatible line:
-e .
Just one more question, do you have any idea about how I could change the x-axis label from "Iterations" to "Epochs"
You mean in the UI (i.e. just the title) ? or are you actually reporting iterations instead of epochs? and if so is this auto connected to tensorboard or is it reported manually ?
but I'd prefer to have a new instance deployed for each new experiment and that it also terminates when no new experiments are queued
I'm not objecting, just wondered on the rational behind the decision π
Back to the AWS autoscaler:
Basically if you have the services-agent running on your cluster, it will just run the aws-autoscaler for you π
The idea of the service-agent is to run logic/monitoring Tasks suck as the aws autoscaler. Notice that service-mode means multiple job per...
Happy new year @<1618780810947596288:profile|ExuberantLion50>
- Is this the right place to mention such bugs?Definitely the right place to discuss them, usually if verified we ask to also add in github for easier traceability / visibility
m (i.e. there's two plots shown side-by-side but they're actually both just the first experiment that was selected). This is happening across all experiments, all my workspaces, and all the browsers I've tried.
Can you share a screenshot? is this r...
But from the log it seems that:
you are not running as root in the docker? Python3.8 is installed (and not python 3.6 as before)
Hi ReassuredTiger98
When clearml is running inside the docker the installed packages of the WebUI get updated.
Yes, this is by design, so the agent can always reproduce the exact python environment.
(internal the original requirements is also stored, but not available in the UI).
What exactly is the use case here ? wouldn't make sense to reproduce the entire working environment when you clone the executed Task ?
Yes, that seems to be the case. That said they should have different worker IDs agent-0 and agent-1 ...
What's your trains-agent version ?
` Collecting inplace-abn==1.0.12
Downloading inplace-abn-1.0.12.tar.gz (137 kB)
ERROR: Command errored out with exit status 1:
command: /home/ubuntu/.clearml/venvs-builds/3.8/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xf3qf6et/inplace-abn_15b6998cb4af4199a7692be5d3a3538f/setup.py'"'"'; file='"'"'/tmp/pip-install-xf3qf6et/inplace-abn_15b6998cb4af4199a7692be5d3a3538f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f...
This should work:
from clearml import Task
task = Task.init(project_name="examples", task_name="shap example")
import xgboost
import shap
# train an XGBoost model
X, y = shap.datasets.california()
model = xgboost.XGBRegressor().fit(X, y)
# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark, etc.)
explainer = shap.Explainer(model)
shap_values = explainer(X)
# visualize the first prediction's explanation
shap.plots...
Yes actually that might be it. Here is how it works,
It launch a thread in the background to do all the analysis of the repository, extracting all the packages.
If the process ends (for any reason), it will give the background thread 10 seconds to finish and then it will give up. If the repository is big, the analysis can take longer, and it will quit
When we enqueue the task using the web-ui we have the above error
ShallowGoldfish8 I think I understand the issue,
basically I think the issue is:task.connect(model_params, 'model_params')
Since this is a nested dict:model_params = { "loss_function": "Logloss", "eval_metric": "AUC", "class_weights": {0: 1, 1: 60}, "learning_rate": 0.1 }
The class_weights is stored as a String key, but catboost expects "int" key, hence it fails.
One op...
I saw documentation, but I can't make the proper dict object for hyperparams
I see, this is what you are after (I think)
https://github.com/allegroai/clearml/blob/fb644fe9ec6be36b8f2f70a34256fbdc593d663a/clearml/backend_api/services/v2_20/tasks.py#L3138
you could also use:
https://github.com/allegroai/clearml/blob/ce7e77a00e869a2690f31cbc578636ce88bc4613/docs/clearml.conf#L188
and setup the clearml.conf
on the users machine to automatically log the environment variables at run time (stored under the Configuration tab).
Then the agent will pull these same variables at execution time and set them
Hi ThickDove42 ,
Yes, but by the time you will be able to access it, it will be in a display form (plotly), not very convient.
If this is something you need to re-use, I would argue that it is an artifact and should be stored as artifact (then accessing it is transparent) , obviously you can both report as table and upload as artifact, no harm in that.
what do you think?
Is there a way to do this all elegantly?
Of yes there is, this is how TaskB code will look:
` task = Task.init(..., 'task b')
param = {'TaskA' :'TaskAs ID HERE'}
task.connect(param)
taska_model = Task.get_task(param['TaskA']).models['output''][-1]
torch.load(taska_model.get_local_copy())
train
torch.save('modelb') `I might have missed something there, but generally speaking this will let you:
Select TASKA as a parameter of TaskB training process Will register automagically Tasks'A...
So it makes sense it installs v8.0.1
(maybe originally you provided no version and it installed the latest one)
This is basically pip's doing the package version resolving
Hi PerplexedCow66
I'm assuming an extension for this:
https://github.com/allegroai/clearml-serving/issues/32
Basically JWT can be used as a general access/block all endpoints, which is most efficnely used if handled by k8s loadbalancer (nginx/envoy),
but if you want a per-endpoint check (or maybe do something based on the JWT values)
See adding JWT to FastAPI here:
https://fastapi.tiangolo.com/tutorial/security/oauth2-jwt/?h=jwt#oauth2-with-password-and-hashing-bearer-with-jwt-tokens
T...
but we run everything in docker containers. Will it still help?
As long as you are running with clearml-agent(in docker mode), all the cache folders (this one included) are mounted on the host machine for persistency
Does this mean the model weights are stored on the clearml-server file system?
By default they are just logged (i.e. the local path is stored, but the file is not uploaded). If you want to automatically store the model, pass output_uri=True
to the Task.init , or any object store / shared folder (e.g. output_uri='
s3://bucket/folder '
). ClearML will automatically create a subfolder for the Task, and upload all models/artifacts to it.
` task = Task.init(project_name='ex...