AgitatedDove14 yes I'll do that, but since the workers run in docker containers it will take a couple of minutes to set the config file up within the container and I have to run now. I'll report back next week
Thanks for the response AgitatedDove14 🙂
I mean to reduce the API calls without reducing the scalars that are logged, e.g. by sending less frequent batched updates.
Yes I am trying the free tier currently, but I imagine the problem would be the same with the paid tier since the 100k api calls can be used up quite fast with a few simultaneous experiments.
Let me know if it has any effect
Unfortunately not. I set DevWorker.report_period_sec
to 600 before creating the task. The scalars still show up in the web ui more or less in real time.
Why would that happen?
I work in a reinforcement learning context using the stable-baselines3 library. If I log 20 scalars every 2000 training steps and train for 1 million steps (which is not that big an experiment), that's already 10k API calls. If I run 10 of these experiments simultaneous (which is also not that many), that's already 100k API calls based on the explicitly logged scalars. Implicitly logged things (hardware temperature, captured streams) may come on top of that.
The logging is already batched (meaning 1API for a bunch of stuff)
Could it be lots of console lines?
That's good to know. I don't think its console lines alone, as described above.
BTW you can set the flush period to 30 sec, which would automatically collectt and batch API calls
Oh nice! Is that for all logged values? How will that count against the API call budget?
f I log 20 scalars every 2000 training steps and train for 1 million steps (which is not that big an experiment), that's already 10k API calls...
They are batched together, so at least in theory if this is fast you should not get to 10K so fast, But a Very good point
Oh nice! Is that for all logged values? How will that count against the API call budget?
Basically this is the "auto flush" it will flash (and batch) all the logs in 30sec period, and yes this is for all the logs (scalar and console)
How often do you report scalars ?
Could it be they are Not being batched for some reason?
AgitatedDove14 I have tried to configure restart_period_sec
in clearml.conf
and I get the same result. The configuration does not seem to have any effect, scalars appear in the web UI in close to real time.
Thanks SmugDolphin23 , that workaround does seem to do the trick 🙂
AgitatedDove14 yes (+sdk): sdk.development.worker.report_period_sec
Ah, I think it should be DevWorker.report_period
(without the sec
) according to the class definition
FlutteringWorm14 we do batch the reported scalars. The flow is like this: the task object will create a Reporter
object which will spawn a daemon in another child process that batches multiple report events. The batching is done after a certain time in the child process, or the parent process can force the batching after a certain number of report events are queued.
You could try this hack to achieve what you want:
` from clearml import Task
from clearml.backend_interface.metrics.reporter import Reporter
Reporter._flush_frequency = property(lambda self: 600, lambda self, other: None)
task = Task.init(task_name="task_name", project_name="project_name")
task._reporter._report_service._flush_threshold = 100 `
I mean to reduce the API calls without reducing the scalars that are logged, e.g. by sending less frequent batched updates.
Understood,
In my current trials I am using up the API calls very quickly though.
Why would that happen?
The logging is already batched (meaning 1API for a bunch of stuff)
Could it be lots of console lines?
BTW you can set the flush period to 30 sec, which would automatically collectt and batch API calls
https://github.com/allegroai/clearml/blob/25df5efe74972624671df2ae97a3c629eb0c5322/docs/clearml.conf#L196
hardware monitoring etc.
This is averaged and being sent only every 30 seconds, not a lot of calls.
I just saw that I went through the first 200k API calls rather fast, so that is how I rationalized it.
Yes, that's kind of makes sens
Once every 2000 steps, which is every few seconds. So in theory those ~20 scalars should be batched since they are reported more or less at the same time. It's a bit odd that the API calls added up so quickly anyway.
The default flush is every 2 seconds, so "real time" but the assumption is most of the time nothing to be seen.
I'll try to decrease the flush frequency (once a minute or even every few minutes is plenty for my use case) and see if it reduces the API calls. Thank you for your help!
Sure thing. Please let me know if it helps.
Is there some way to configure this without using the CLI to generate a client config? I'm currently using the environment-variables based setup to avoid leaving state on the client.
I think that dues to the fact that the actual data is being sent in a background Process (not thread) once the Task is created, these have smaller effect (we should somehow fox that, but currently there is no way to do that)
You can hack it though:
` from clearml.backend_interface.task.development.worker import DevWorker
DevWorker.report_period_sec = 600 `Let me know if it has any effect
FlutteringWorm14 an RC is out (1.7.3dc1) with the ability to configure from clearml.conf
you can now setsdk.development.worker.report_event_flush_threshold
from clearml.conf
They are batched together, so at least in theory if this is fast you should not get to 10K so fast, But a Very good point
That's only a back of the napkin calculation, in the actual experiments I mostly had stream logging, hardware monitoring etc. enabled as well so maybe that limited the effectiveness of the batching. I just saw that I went through the first 200k API calls rather fast, so that is how I rationalized it.
Basically this is the "auto flush" it will flash (and batch) all the logs in 30sec period, and yes this is for all the logs (scalar and console)
Perfect, sounds like that is exactly what I'm looking for 🙂
How often do you report scalars ?
Could it be they are Not being batched for some reason?
Once every 2000 steps, which is every few seconds. So in theory those ~20 scalars should be batched since they are reported more or less at the same time. It's a bit odd that the API calls added up so quickly anyway.
I'll try to decrease the flush frequency (once a minute or even every few minutes is plenty for my use case) and see if it reduces the API calls. Thank you for your help!
Is there some way to configure this without using the CLI to generate a client config? I'm currently using the environment-variables based setup to avoid leaving state on the client.
I tried to run clearml_task.get_logger().set_flush_period(600)
after initializing the task, but that doesn't seem to have the desired effect (scalars are updated much more frequently than every 10 minutes).
Hi FlutteringWorm14 ! Looks like we indeed don't wait for report_period_sec
when reporting data. We will fix this in a future release. Thank you!
FlutteringWorm14 Can you verify that even with the clearml.conf it has no effect?
Even monkey-patching the config mechanism (and verifying that this worked by printing the default of DevWorker.report_period
) leads to the same result. Either the other process has already started at that point for some reason or the buffering is not working as expected. I'll try to work with the config file, but I have to call it a day now so unfortunately I won't get to it this week. Thank you for your help so far!
Great, thanks 🙂 So for now the reporting is not batched at all, i.e. each reported scalar is one API call?
Hi FlutteringWorm14
Is there some way to limit that?
What do you mean by that? are you referring to the Free tier ?
Unfortunately that doesn't seem to have an effect either though
restart_period_sec
I'm assuming development.worker.report_period_sec
, correct?
The configuration does not seem to have any effect, scalars appear in the web UI in close to real time.
Let me see if we can reproduce this behavior and quickly fix
The snipped I used for monkey patching:
from clearml.config import ConfigSDKWrapper old_get = ConfigSDKWrapper.get def new_get(key, *args): if key == "development.worker.report_period_sec": return 600.0 return old_get(key, *args) ConfigSDKWrapper.get = new_get