FlutteringTurkey14

Moderator

2 Questions, 28 Answers

Active since 10 January 2023

Last activity 2 years ago

Reputation

Badges 1

28 × Eureka!

Questions 2
Answers 28

0 Votes

26 Answers

1K Views

0 Votes 26 Answers 1K Views

Hi! I'M Currently Considering Switching To Clearml. In My Current Trials I Am Using Up The Api Calls Very Quickly Though. Is There Some Way To Limit That? The Documentation Is A Bit Sparse On What Uses How Many Api Calls. Is It Possible To Batch Them For

Hi! I'm currently considering switching to ClearML. In my current trials I am using up the API calls very quickly though. Is there some way to limit that? Th...

clearml

2 years ago

0 Votes

24 Answers

1K Views

0 Votes 24 Answers 1K Views

Hi. I Somehow Managed To Exceed The Metrics Quota By ~35Gb. I Logged Some Histograms, But Still That Seems Excessive. Now I Am Trying To Delete Archived Experiments With The Cleanup Service, But Some Tasks Cannot Be Deleted:

Hi. I somehow managed to exceed the metrics quota by ~35GB. I logged some histograms, but still that seems excessive. Now I am trying to delete archived expe...

clearml

2 years ago

0 Hi. I Somehow Managed To Exceed The Metrics Quota By ~35Gb. I Logged Some Histograms, But Still That Seems Excessive. Now I Am Trying To Delete Archived Experiments With The Cleanup Service, But Some Tasks Cannot Be Deleted:

Thanks :hugging_face:

2 years ago

AgitatedDove14 those are all tasks for which I have accidentally logged a large amount of histograms, in the order of gigabytes. It consistently fails when I try to delete the same task

2 years ago

Friendly ping CostlyOstrich36

2 years ago

The server is the public one hosted at http://app.clear.ml . The client is at version 1.7.2

2 years ago

This works SuccessfulKoala55 ! It's very slow though, it's probably downloading the data before deleting it. But that's okay, at least it works. Thanks a lot 🙂

2 years ago

Some run hashes are in the logs I posted, if you have the permissions to access these feel free

2 years ago

AgitatedDove14 could you maybe have a look? For some reason I am not able to delete some (particularly large) tasks using the cleanup service, i.e. API calls in the form

deleted_task = Task.get_task(task_id=task.id) deleted_task.delete( delete_artifacts_and_models=True, skip_models_used_by_other_tasks=True, raise_on_error=False )

2 years ago

0 Hi! I'M Currently Considering Switching To Clearml. In My Current Trials I Am Using Up The Api Calls Very Quickly Though. Is There Some Way To Limit That? The Documentation Is A Bit Sparse On What Uses How Many Api Calls. Is It Possible To Batch Them For

Thanks for the response AgitatedDove14 🙂

I mean to reduce the API calls without reducing the scalars that are logged, e.g. by sending less frequent batched updates.

Yes I am trying the free tier currently, but I imagine the problem would be the same with the paid tier since the 100k api calls can be used up quite fast with a few simultaneous experiments.

2 years ago

AgitatedDove14 yes (+sdk): sdk.development.worker.report_period_sec

2 years ago

Why would that happen?

I work in a reinforcement learning context using the stable-baselines3 library. If I log 20 scalars every 2000 training steps and train for 1 million steps (which is not that big an experiment), that's already 10k API calls. If I run 10 of these experiments simultaneous (which is also not that many), that's already 100k API calls based on the explicitly logged scalars. Implicitly logged things (hardware temperature, captured streams) may come on top of that.

T...

2 years ago

Let me know if it has any effect

Unfortunately not. I set DevWorker.report_period_sec to 600 before creating the task. The scalars still show up in the web ui more or less in real time.

2 years ago

SuccessfulKoala55 yes that gives some more information:

` Deleting 11 tasks
Traceback (most recent call last):
File "/root/.cache/pypoetry/virtualenvs/reward-learner-ASq25l3C-py3.10/lib/python3.10/site-packages/clearml/backend_interface/task/task.py", line 711, in _delete
[x for x in filter(None, self._get_image_plot_uris()) if not callback or callback("image_plot", x)]
File "/root/.cache/pypoetry/virtualenvs/reward-learner-ASq25l3C-py3.10/lib/python3.10/site-packages/clearml/backen...

2 years ago

AgitatedDove14 the cleanup_service.py script in the repository, which https://github.com/allegroai/clearml/blob/ff7b174bf162347b82226f413040ff6473401e92/examples/services/cleanup/cleanup_service.py#L82 the snippet I posted.

2 years ago

Great, thank you!

2 years ago

Do you know when the next update of the usage metrics is scheduled? Do I have to wait until tomorrow before I can use clearml again?

2 years ago

The snipped I used for monkey patching:

from clearml.config import ConfigSDKWrapper old_get = ConfigSDKWrapper.get def new_get(key, *args): if key == "development.worker.report_period_sec": return 600.0 return old_get(key, *args) ConfigSDKWrapper.get = new_get

2 years ago

AgitatedDove14 yes I'll do that, but since the workers run in docker containers it will take a couple of minutes to set the config file up within the container and I have to run now. I'll report back next week

2 years ago

AgitatedDove14 CostlyOstrich36 Sorry for pinging again, but is there anything I can do to delete those tasks?

2 years ago

AgitatedDove14 I have tried to configure restart_period_sec in clearml.conf and I get the same result. The configuration does not seem to have any effect, scalars appear in the web UI in close to real time.

2 years ago

Great SuccessfulKoala55 🙂 Do you have any ideas on things I could try to work around the issue / further clarify it?

2 years ago

Thanks SmugDolphin23 , that workaround does seem to do the trick 🙂

2 years ago

How can I delete them manually? Is that possible in the UI?

2 years ago

Is there some way to configure this without using the CLI to generate a client config? I'm currently using the environment-variables based setup to avoid leaving state on the client.

I tried to run clearml_task.get_logger().set_flush_period(600) after initializing the task, but that doesn't seem to have the desired effect (scalars are updated much more frequently than every 10 minutes).

2 years ago

They are batched together, so at least in theory if this is fast you should not get to 10K so fast, But a Very good point

That's only a back of the napkin calculation, in the actual experiments I mostly had stream logging, hardware monitoring etc. enabled as well so maybe that limited the effectiveness of the batching. I just saw that I went through the first 200k API calls rather fast, so that is how I rationalized it.

Basically this is the "auto flush" it will flash (and batch) al...

2 years ago

Great, thanks 🙂 So for now the reporting is not batched at all, i.e. each reported scalar is one API call?

2 years ago

Even monkey-patching the config mechanism (and verifying that this worked by printing the default of DevWorker.report_period ) leads to the same result. Either the other process has already started at that point for some reason or the buffering is not working as expected. I'll try to work with the config file, but I have to call it a day now so unfortunately I won't get to it this week. Thank you for your help so far!

2 years ago

Ah, I think it should be DevWorker.report_period (without the sec ) according to the class definition

2 years ago

Unfortunately that doesn't seem to have an effect either though

2 years ago