Hi! I'M Currently Considering Switching To Clearml. In My Current Trials I Am Using Up The Api Calls Very Quickly Though. Is There Some Way To Limit That? The Documentation Is A Bit Sparse On What Uses How Many Api Calls. Is It Possible To Batch Them For

Answered

Hi! I'm currently considering switching to ClearML. In my current trials I am using up the API calls very quickly though. Is there some way to limit that? The documentation is a bit sparse on what uses how many api calls. Is it possible to batch them for example? I wouldn't mind if the updates came in less frequently.

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

Votes Newest

Answers 26

I mean to reduce the API calls without reducing the scalars that are logged, e.g. by sending less frequent batched updates.

Understood,

In my current trials I am using up the API calls very quickly though.

Why would that happen?
The logging is already batched (meaning 1API for a bunch of stuff)
Could it be lots of console lines?

BTW you can set the flush period to 30 sec, which would automatically collectt and batch API calls
https://github.com/allegroai/clearml/blob/25df5efe74972624671df2ae97a3c629eb0c5322/docs/clearml.conf#L196

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

FlutteringWorm14 Can you verify that even with the clearml.conf it has no effect?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is there some way to configure this without using the CLI to generate a client config? I'm currently using the environment-variables based setup to avoid leaving state on the client.

I tried to run clearml_task.get_logger().set_flush_period(600) after initializing the task, but that doesn't seem to have the desired effect (scalars are updated much more frequently than every 10 minutes).

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

restart_period_sec

I'm assuming development.worker.report_period_sec , correct?

The configuration does not seem to have any effect, scalars appear in the web UI in close to real time.

Let me see if we can reproduce this behavior and quickly fix

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

hardware monitoring etc.

This is averaged and being sent only every 30 seconds, not a lot of calls.

I just saw that I went through the first 200k API calls rather fast, so that is how I rationalized it.

Yes, that's kind of makes sens

Once every 2000 steps, which is every few seconds. So in theory those ~20 scalars should be batched since they are reported more or less at the same time. It's a bit odd that the API calls added up so quickly anyway.

The default flush is every 2 seconds, so "real time" but the assumption is most of the time nothing to be seen.

I'll try to decrease the flush frequency (once a minute or even every few minutes is plenty for my use case) and see if it reduces the API calls. Thank you for your help!

Sure thing. Please let me know if it helps.

Is there some way to configure this without using the CLI to generate a client config? I'm currently using the environment-variables based setup to avoid leaving state on the client.

I think that dues to the fact that the actual data is being sent in a background Process (not thread) once the Task is created, these have smaller effect (we should somehow fox that, but currently there is no way to do that)
You can hack it though:
` from clearml.backend_interface.task.development.worker import DevWorker

DevWorker.report_period_sec = 600 `Let me know if it has any effect

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks for the response AgitatedDove14 🙂

I mean to reduce the API calls without reducing the scalars that are logged, e.g. by sending less frequent batched updates.

Yes I am trying the free tier currently, but I imagine the problem would be the same with the paid tier since the 100k api calls can be used up quite fast with a few simultaneous experiments.

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

Why would that happen?

I work in a reinforcement learning context using the stable-baselines3 library. If I log 20 scalars every 2000 training steps and train for 1 million steps (which is not that big an experiment), that's already 10k API calls. If I run 10 of these experiments simultaneous (which is also not that many), that's already 100k API calls based on the explicitly logged scalars. Implicitly logged things (hardware temperature, captured streams) may come on top of that.

The logging is already batched (meaning 1API for a bunch of stuff)
Could it be lots of console lines?

That's good to know. I don't think its console lines alone, as described above.

BTW you can set the flush period to 30 sec, which would automatically collectt and batch API calls

Oh nice! Is that for all logged values? How will that count against the API call budget?

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

Great, thanks 🙂 So for now the reporting is not batched at all, i.e. each reported scalar is one API call?

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

Sure thing, thanks FlutteringWorm14 !

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi FlutteringWorm14 ! Looks like we indeed don't wait for report_period_sec when reporting data. We will fix this in a future release. Thank you!

  				
Posted 
	2 years ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Thanks SmugDolphin23 , that workaround does seem to do the trick 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

Hi FlutteringWorm14

Is there some way to limit that?

What do you mean by that? are you referring to the Free tier ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Unfortunately that doesn't seem to have an effect either though

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

AgitatedDove14 I have tried to configure restart_period_sec in clearml.conf and I get the same result. The configuration does not seem to have any effect, scalars appear in the web UI in close to real time.

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

Thanks FlutteringWorm14 , checking 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

They are batched together, so at least in theory if this is fast you should not get to 10K so fast, But a Very good point

That's only a back of the napkin calculation, in the actual experiments I mostly had stream logging, hardware monitoring etc. enabled as well so maybe that limited the effectiveness of the batching. I just saw that I went through the first 200k API calls rather fast, so that is how I rationalized it.

Basically this is the "auto flush" it will flash (and batch) all the logs in 30sec period, and yes this is for all the logs (scalar and console)

Perfect, sounds like that is exactly what I'm looking for 🙂

How often do you report scalars ?
Could it be they are Not being batched for some reason?

Once every 2000 steps, which is every few seconds. So in theory those ~20 scalars should be batched since they are reported more or less at the same time. It's a bit odd that the API calls added up so quickly anyway.

I'll try to decrease the flush frequency (once a minute or even every few minutes is plenty for my use case) and see if it reduces the API calls. Thank you for your help!

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

FlutteringWorm14 we do batch the reported scalars. The flow is like this: the task object will create a Reporter object which will spawn a daemon in another child process that batches multiple report events. The batching is done after a certain time in the child process, or the parent process can force the batching after a certain number of report events are queued.
You could try this hack to achieve what you want:
` from clearml import Task
from clearml.backend_interface.metrics.reporter import Reporter

Reporter._flush_frequency = property(lambda self: 600, lambda self, other: None)
task = Task.init(task_name="task_name", project_name="project_name")
task._reporter._report_service._flush_threshold = 100 `

  				
Posted 
	2 years ago

					More  		
  Report
		
					SmugDolphin23
				
					0

AgitatedDove14 yes (+sdk): sdk.development.worker.report_period_sec

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

Great, thank you!

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

The snipped I used for monkey patching:

from clearml.config import ConfigSDKWrapper old_get = ConfigSDKWrapper.get def new_get(key, *args): if key == "development.worker.report_period_sec": return 600.0 return old_get(key, *args) ConfigSDKWrapper.get = new_get

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

FlutteringWorm14 an RC is out (1.7.3dc1) with the ability to configure from clearml.conf
you can now set
sdk.development.worker.report_event_flush_threshold from clearml.conf

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ah, I think it should be DevWorker.report_period (without the sec ) according to the class definition

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

Even monkey-patching the config mechanism (and verifying that this worked by printing the default of DevWorker.report_period ) leads to the same result. Either the other process has already started at that point for some reason or the buffering is not working as expected. I'll try to work with the config file, but I have to call it a day now so unfortunately I won't get to it this week. Thank you for your help so far!

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

Let me know if it has any effect

Unfortunately not. I set DevWorker.report_period_sec to 600 before creating the task. The scalars still show up in the web ui more or less in real time.

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

AgitatedDove14 yes I'll do that, but since the workers run in docker containers it will take a couple of minutes to set the config file up within the container and I have to run now. I'll report back next week

  				
Posted 
	2 years ago

					More  		
  Report
		
					FlutteringTurkey14
				
					0
					 × 1

f I log 20 scalars every 2000 training steps and train for 1 million steps (which is not that big an experiment), that's already 10k API calls...

They are batched together, so at least in theory if this is fast you should not get to 10K so fast, But a Very good point

Oh nice! Is that for all logged values? How will that count against the API call budget?

Basically this is the "auto flush" it will flash (and batch) all the logs in 30sec period, and yes this is for all the logs (scalar and console)

How often do you report scalars ?
Could it be they are Not being batched for some reason?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

26 Answers

2 years ago