Reputation
Badges 1
28 × Eureka!AgitatedDove14 the cleanup_service.py
script in the repository, which https://github.com/allegroai/clearml/blob/ff7b174bf162347b82226f413040ff6473401e92/examples/services/cleanup/cleanup_service.py#L82 the snippet I posted.
AgitatedDove14 those are all tasks for which I have accidentally logged a large amount of histograms, in the order of gigabytes. It consistently fails when I try to delete the same task
AgitatedDove14 CostlyOstrich36 Sorry for pinging again, but is there anything I can do to delete those tasks?
AgitatedDove14 could you maybe have a look? For some reason I am not able to delete some (particularly large) tasks using the cleanup service, i.e. API calls in the form
deleted_task = Task.get_task(task_id=task.id) deleted_task.delete( delete_artifacts_and_models=True, skip_models_used_by_other_tasks=True, raise_on_error=False )
SuccessfulKoala55 yes that gives some more information:
` Deleting 11 tasks
Traceback (most recent call last):
File "/root/.cache/pypoetry/virtualenvs/reward-learner-ASq25l3C-py3.10/lib/python3.10/site-packages/clearml/backend_interface/task/task.py", line 711, in _delete
[x for x in filter(None, self._get_image_plot_uris()) if not callback or callback("image_plot", x)]
File "/root/.cache/pypoetry/virtualenvs/reward-learner-ASq25l3C-py3.10/lib/python3.10/site-packages/clearml/backen...
Some run hashes are in the logs I posted, if you have the permissions to access these feel free
They are batched together, so at least in theory if this is fast you should not get to 10K so fast, But a Very good point
That's only a back of the napkin calculation, in the actual experiments I mostly had stream logging, hardware monitoring etc. enabled as well so maybe that limited the effectiveness of the batching. I just saw that I went through the first 200k API calls rather fast, so that is how I rationalized it.
Basically this is the "auto flush" it will flash (and batch) al...
Is there some way to configure this without using the CLI to generate a client config? I'm currently using the environment-variables based setup to avoid leaving state on the client.
I tried to run clearml_task.get_logger().set_flush_period(600)
after initializing the task, but that doesn't seem to have the desired effect (scalars are updated much more frequently than every 10 minutes).
The snipped I used for monkey patching:
from clearml.config import ConfigSDKWrapper old_get = ConfigSDKWrapper.get def new_get(key, *args): if key == "development.worker.report_period_sec": return 600.0 return old_get(key, *args) ConfigSDKWrapper.get = new_get
Thanks SmugDolphin23 , that workaround does seem to do the trick 🙂
Great, thanks 🙂 So for now the reporting is not batched at all, i.e. each reported scalar is one API call?
Thanks for the response AgitatedDove14 🙂
I mean to reduce the API calls without reducing the scalars that are logged, e.g. by sending less frequent batched updates.
Yes I am trying the free tier currently, but I imagine the problem would be the same with the paid tier since the 100k api calls can be used up quite fast with a few simultaneous experiments.
Let me know if it has any effect
Unfortunately not. I set DevWorker.report_period_sec
to 600 before creating the task. The scalars still show up in the web ui more or less in real time.
The server is the public one hosted at http://app.clear.ml . The client is at version 1.7.2
AgitatedDove14 I have tried to configure restart_period_sec
in clearml.conf
and I get the same result. The configuration does not seem to have any effect, scalars appear in the web UI in close to real time.
How can I delete them manually? Is that possible in the UI?
Even monkey-patching the config mechanism (and verifying that this worked by printing the default of DevWorker.report_period
) leads to the same result. Either the other process has already started at that point for some reason or the buffering is not working as expected. I'll try to work with the config file, but I have to call it a day now so unfortunately I won't get to it this week. Thank you for your help so far!
Unfortunately that doesn't seem to have an effect either though
Why would that happen?
I work in a reinforcement learning context using the stable-baselines3 library. If I log 20 scalars every 2000 training steps and train for 1 million steps (which is not that big an experiment), that's already 10k API calls. If I run 10 of these experiments simultaneous (which is also not that many), that's already 100k API calls based on the explicitly logged scalars. Implicitly logged things (hardware temperature, captured streams) may come on top of that.
T...
AgitatedDove14 yes (+sdk): sdk.development.worker.report_period_sec
Ah, I think it should be DevWorker.report_period
(without the sec
) according to the class definition
AgitatedDove14 yes I'll do that, but since the workers run in docker containers it will take a couple of minutes to set the config file up within the container and I have to run now. I'll report back next week
This works SuccessfulKoala55 ! It's very slow though, it's probably downloading the data before deleting it. But that's okay, at least it works. Thanks a lot 🙂
Do you know when the next update of the usage metrics is scheduled? Do I have to wait until tomorrow before I can use clearml again?
Great SuccessfulKoala55 🙂 Do you have any ideas on things I could try to work around the issue / further clarify it?