We are running workers as bare metal and clearml-server on Kubernetes. I was trying to find, what are those min and max value for above metrics.
What do you mean by how much is reserved ? Are you running with an agent?
Yeah exactly. Scalar tab have those but I need to add track in the alert if GPU utilization/gpu memory not in use and experiment in progress then alert. Can I get gpu usage over time frame via API also?
. Can I get gpu usage over time frame via API also?
task.get_reported_scalars
But this will get you All the scalars, I think the next version of the server supports asking a specific one as well.
How are you implementing the alert monitoring?
Is is a stateless process starting every X min, or is it a state-full process running and monitoring ?
Hi DrabCockroach54
Do we know if gpu_0_mem_usage and gpu_0_mem_used_gb, both shows current GPU usage?
the first is percentage used (memory % used at any specific moment) and the second is memory used GiB , both for the video memory
How to know from this how much GPU is reserved for the task if this task is in progress?
What do you mean by how much is reserved ? Are you running with an agent?
Is gpu_0_utilization also in % then?
Correct 🙂
I was trying to find, what are those min and max value for above metrics.
Oh that makes sense, notice that you can get the values over time, so you can track the usage over the experiment lifetime (you can of course see it in the Scalar tab of the experiment)
Thanks for the reply. If gpu_0_mem_usage is % of GPU memory in use, what is gpu_0_utilization ?
Is gpu_0_utilization also in % then?