In my last two posts, I discussed how to avoid and enforce pod scheduling on specific nodes. You can read them here
Avoidscheduling pods on certain nodes
In this post, I will talk about, how to schedule your pods near to some other pod or in other words schedule a pod on a node which already has some other pod running on it. I will use some terminology from my earlier post about scheduling pods on certain nodes. So, please make sure you get some understanding from that post.
Idea here is to make sure when your pod is scheduled on some
node, you want to make sure, it is placed on a node which already has some
other pod running or it should not have some other pod running. First, let’s
talk about possible use cases where you might need this.
Use cases
- A application can have pods which frequently talk to each other, so you need to make sure, they are placed on same nodes to avoid latency. For ex: In an ecommerce application, every time you place an order, service needs to check inventory before accepting the order. For every order placement, there will be a inventory check service invocation. So, you might want to make sure that node having a order service pod should also have inventory service pod.
- Another example could be, to place a cache pod near to a web application pod for faster access to cache contents.
- You need to make sure that no more than one pod is scheduled on a node to make sure your pods are as distributed as possible.
Pod Affinity: Schedule pods closer to already running
pods
In my previous post, I talked about node affinity to
schedule pods on specific set of nodes. In this post I will talk about another
affinity which is pod affinity. So, here is how pod affinity and anti affinity
works:
Schedule a pod (or don’t schedule, in case of
anti-affinity) on a node X, if pod Y is already running on that node. Here X is
a label key of the node (it can have any value) and Y is the label assigned to
already running pod.
Don’t panic, if you could not make much sense out of above
statement. We will discuss this in detail.
Like nodeaffinity, podaffinity is of 2 types:
- requiredDuringSchedulingIgnoredDuringExecution: This setting means that the specified affinity or anti-affinity rule must be met for the pod to be scheduled onto a node. If the criteria cannot be fulfilled, the pod will remain unscheduled, waiting for the conditions to be satisfied. This can be useful for critical applications that require specific placement for performance or regulatory reasons but might lead to unscheduled pods if the cluster does not have the capacity or the right node configuration.
- preferredDuringSchedulingIgnoredDuringExecution: This setting tells the scheduler that it should try to enforce the rules, but if it can't, the pod can still be scheduled onto a node that doesn't meet the criteria. This approach provides more flexibility and is generally recommended for use in larger or more dynamic environments where strict adherence to the rules may not always be possible or optimal.
The choice between required and preferred rules will depend on the specific needs of your applications and the desired balance between strict scheduling requirements and the flexibility of pod placement.
To understand pod affinity, let’s discuss a scenario from
one of the use cases above, where you want to place a web-app pod on the same
node where cache pod is already running. (or vice versa depending upon your
application)
Here is the yaml definition to achieve the same.
apiVersion: v1
kind: Pod
metadata:
name: web-app
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: name
operator: In
values:
- web-cache
topologyKey: my-node-label
containers:
- name: web-app
image: <container-image>
This YAML ensures that the web-app
pod is scheduled on a node that already runs a pod labeled web-cache
. The topologyKey
ensures that the node selection is based on a label key, ensuring pods are placed together efficiently.
Let's talk about some key elements used in above YAML
podAffinity: This contains all the required rules for pod scheduling
topologyKey: This is the label key assigned to the
target nodes. Remember, label has key and value and we are talking only about
key. This can be any label key (value does not matter) whether set by you or by
your cloud vendor.
With above pod definition file, we are deploying a pod containing web application named “we-app” and we want to place this pod on a node which is already running a cache pod named “web-cache”. Under podAffinity, we are telling scheduler to schedule this pod on a node having a label with key “my-node-label” and if a pod with name “web-cache” is already running on this node.
While trying to find suitable node for this pod, scheduler takes care of above requirement and whether pod is scheduled or not depends upon podAffinity type used. In our example, we used podAffinity of type, requiredDuringSchedulingIgnoredDuringExecution. If there is no node which is running a cache pod your web pod will not be scheduled and will be in pending state until it finds a suitable node. So, it is important that you are aware of how your application works and which pods will be scheduled before other pods and accordingly use the appropriate podAffinity type.
Pod Anti Affinity: Avoid scheduling pods on nodes which
are already running certain pods
To understand pod anti-affinity, let’s discuss another
scenario from one of the use cases above, where we want to make sure no more
than one pod of a type is running on a node. Below yaml definition achieves
this.
apiVersion: v1
kind: Pod
metadata:
name: web-app
labels:
app: web-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-app
topologyKey: my-node-label
containers:
- name: web-app
image: <container-image>
The only new element is podAntiAffinity, which is opposite
to podAffinity. podAffinity is all about scheduling and podAntiAffinity is
about not scheduling.
In above yaml defintion, we are trying to deploy a pod named
“web-app” and goal is to make sure no more than one “web-app” pod is running on
a node.
Under podAntiAffinity, we are telling scheduler to schedule
this pod on a node having a label with key “my-node-label” if and only if a pod
with name “web-app” is not running on this node. So,
if a pod with same name is running on this node, new pod will be scheduled on
some other node.
A word of caution
Though podAffinity and podAntiAffinity let you control
placement of pods but this also has a downside. pod affinity and anti-affinity
require substantial amount of processing which can slow down scheduling in
large clusters significantly. These are not recommended for clusters larger
than several hundred nodes.
Million dollar question: why not place containers in same
pod
Why should we get into this much complexity to place 2 pods
together. Why not place multiple containers in a single pod, after all pod is
nothing but a wrapper around container. So, rather than going though all the
complexity above, should we just place multiple containers in a pod. Short
answer is, it depends. Longer answer, you need to keep following
things in mind:
- No
one can stop you from putting multiple containers in a pod but it all
depends on your application requirements. If you think, your business
model and application operation mode would benefit from this go ahead and
do this.
- If
due to some problem in one of your container, your pod crashes, you are
losing another application also, which might not be at fault at all.
- While
horizontal scaling, you will be scaling both applications together, even
if there was a need to scale only one. That would mean, you will pay for
more resources.
Conclusion
Pod affinity and anti-affinity are powerful tools for controlling pod placement in Kubernetes. However, in large clusters, be mindful of the performance impact. Always balance the need for specific pod placement with scalability and performance concerns.
Please comment in comments section if you have question or have any feedback.