Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
Actually I am as well, this is Kubernets doing the resource scheduling and actually Kubernetes decided it is okay to run two pods on the Same GPU, which is cool, but I was not aware Nvidia already added this feature (I know it was in beta for a long time)
https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/
I also see thety added dynamic slicing and Memory Proteciton:
Notice you can control the number of pods per GPU
This mechanism for enabling “time-sharing” of GPUs in Kubernetes allows a system administrator to define a set of “replicas” for a GPU, each of which can be handed out independently to a pod to run workloads on. Unlike MIG, there is no memory or fault-isolation between replicas, but for some workloads this is better than not being able to share at all. Internally, GPU time-slicing is used to multiplex workloads from replicas of the same underlying GPU.
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-sharing.html#introduction
https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing
Lastly are you using MIG enabled devices? If so you can limit the memory per shared Pod:
https://github.com/NVIDIA/k8s-device-plugin/blob/e2b4ff39b5b4cebe702c8aa102b914b03f6eb81d/README.md#configuration-option-details
Back to the original remark SarcasticSquirrel56 limiting the Pod allocation can also be done via the general k8s "memory limit" requirement, which will take place on top of the GPU plugin, so essentially let's say we have a node with 100GB RAM we can have a pod limit of 25GB, which means no more than 4 pods will be running on the same node.
Does that make sense ?