How to expose a headless service for a StatefulSet externally in Kubernetes

We have solved this in 1.7 by changing the headless service to Type=NodePort and setting the externalTrafficPolicy=Local. This bypasses the internal load balancing of a Service and traffic destined to a specific node on that node port will only work if a Kafka pod is on that node.

apiVersion: v1
kind: Service
metadata:
  name: broker
spec:
  externalTrafficPolicy: Local
  ports:
  - nodePort: 30000
    port: 30000
    protocol: TCP
    targetPort: 9092
  selector:
    app: broker
  type: NodePort

For example, we have two nodes nodeA and nodeB, nodeB is running a kafka pod. nodeA:30000 will not connect but nodeB:30000 will connect to the kafka pod running on nodeB.

https://kubernetes.io/docs/tutorials/services/source-ip/#source-ip-for-services-with-typenodeport

Note this was also available in 1.5 and 1.6 as a beta annotation, more can be found here on feature availability: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip

Note also that while this ties a kafka pod to a specific external network identity, it does not guarantee that your storage volume will be tied to that network identity. If you are using the VolumeClaimTemplates in a StatefulSet then your volumes are tied to the pod while kafka expects the volume to be tied to the network identity.

For example, if the kafka-0 pod restarts and kafka-0 comes up on nodeC instead of nodeA, kafka-0's pvc (if using VolumeClaimTemplates) has data that it is for nodeA and the broker running on kafka-0 starts rejecting requests thinking that it is nodeA not nodeC.

To fix this, we are looking forward to Local Persistent Volumes but right now we have a single PVC for our kafka StatefulSet and data is stored under $NODENAME on that PVC to tie volume data to a particular node.

https://github.com/kubernetes/features/issues/121 https://kubernetes.io/docs/concepts/storage/volumes/#local


Note: I completely rewrote this post a year after the initial posting:
1. Some of what I wrote is no longer relevant given updates to Kubernetes, and I figured it should be deleted to avoid confusing people.
2. I now know more about both Kubernetes and Kafka and should be able to do a better explanation.

Background Contextual Understanding of Kafka on Kubernetes:
Let's say a service of type cluster IP and stateful set are used to deploy a 5 pod Kafka cluster on a Kubernetes Cluster, because a stateful set was used to create the pods they each automatically get the following 5 inner cluster dns names, and then the kafka service of type clusterIP gives another inner cluster dns name.

M$*  kafka-0.my-kafka-headless-service.my-namespace.svc.cluster.local 
M$   kafka-1.my-kafka-headless-service.my-namespace.svc.cluster.local 
M *  kafka-2.my-kafka-headless-service.my-namespace.svc.cluster.local 
M *  kafka-3.my-kafka-headless-service.my-namespace.svc.cluster.local 
M$   kafka-4.my-kafka-headless-service.my-namespace.svc.cluster.local
     kafka-service.my-namespace.svc.cluster.local

^ Let's say you have 2 Kafka topics: $ and *
Each Kafka topic is replicated 3 times across the 5 pod Kafka cluster
(the ASCII diagram above shows which pods hold the replicas of the $ and * topics, M represents metadata)

4 useful bits of background knowledge:
1. .svc.cluster.local is the inner cluster DNS FQDN, but pods automatically are populated with the knowledge to autocomplete that, so you can omit it when talking via inner cluster DNS.
2. kafka-x.my-kafka-headless-service.my-namespace inner cluster DNS name resolves to a single pod.
3. kafka-service.my-namespace kubernetes service of type cluster IP acts like an inner cluster Layer 4 Load Balancer, and will round-robin traffic between the 5 kafka pods.
4. A critical Kafka specific concept to realize is when a Kafka client talks to a Kafka cluster it does so in 2 phases. Let's say a Kafka client wants to read the $ topic from the Kafka cluster.
Phase 1: The client reads the kafka clusters metadata, this is synchronized across all 5 kafka pods so it doesn't matter which one the client talks to, therefore it can be useful to do the initial communication using kafka-service.my-namespace (which LB's and only forwards to a random healthy kafka pod)
Phase 2: The metadata tells the Kafka client which Kafka brokers/nodes/servers/pods have the topic of interest, in this case $ exists on 0, 1, and 4. So for Phase 2 the client will only talk directly to the Kafka brokers that have the data it needs.

How to Externally Expose Pods of a Headless Service/Statefulset and Kafka specific Nuance:
Let's say that I have a 3 pod HashiCorp Consul Cluster spun up on a Kubernetes Cluster, I configure it so the webpage is enabled and I want to see the webpage from the LAN/externally expose it. There's nothing special about the fact that the pods are headless. You can use a Service of type NodePort or LoadBalancer to expose them as you normally would any pod, and the NP or LB will round robin LB incoming traffic between the 3 consul pods.

Because Kafka communication happens in 2 phases, this introduces some nuances where the normal method of externally exposing the statefulset's headless service using a single service of type LB or NP might not work when you have a Kafka Cluster of more than 1 Kafka pod.
1. The Kafka client is expecting to speak directly to the Kafka Broker during Phase 2 communications. So instead of 1 Service of type NodePort, you might want 6 services of type NodePort/LB. 1 that will round-robin LB traffic for phase 1, and 5 with a 1:1 mapping to individual pods for Phase 2 communication.
(If you run kubectl get pods --show-labels against the 5 Kafka pods, you'll see that each pod of the stateful set has a unique label, statefulset.kubernetes.io/pod-name=kafka-0, and that allows you to manually create 1 NP/LB service that maps to 1 pod of a stateful set.) (Note this alone isn't enough)
2. When you install a Kafka Clusters on Kubernetes it's common for its default configuration to only support Kafka Clients inside the Kubernetes Cluster. Remember that Metadata from phase1 of a Kafka Client talking to a Kafka Cluster, well the kafka cluster may have been configured so that it's "advertised listeners" are made of inner cluster DNS names. So when the LAN client talks to an externally exposed Kafka Cluster via NP/LB, it's successful on phase1, but fails on phase 2, because the metadata returned by phase1 gave inner cluster DNS names as the means of communicating directly with the pods during phase 2 communications, which wouldn't be resolvable by clients outside the cluster and thus only work for Kafka Clients Inside the Cluster. So it's important to configure your kafka cluster so the "advertised.listeners" returned by the phase 1 metadata are resolvable by both clients external to the cluster and internal to the cluster.

Clarity on where the Problem caused by Kafka Nuance Lies:
For Phase 2 of communication between Kafka Client -> Broker, you need to configure the "advertised.listeners" to be externally resolvable. This is difficult to pull off using Standard Kubernetes Logic, because what you need is for kafka-0 ... kafka-4 to each have a unique configuration/each to have a unique "advertised.listeners" that's externally reachable. But by default statefulsets are meant to have cookie-cutter configurations that are more or less identical.

Solution to the Problem caused by Kafka Nuances:
The Bitnami Kafka Helm Chart has some custom logic that allows each pod in the statefulset to have a unique "advertised.listerners" configuration. Bitnami Offers hardened Containers, according to Quay.io 2.5.0 only has a single High CVE, runs as non-root, has reasonable documentation, and can be externally exposed*, https://quay.io/repository/bitnami/kafka?tab=tags

The last project I was on I went with Bitnami, because security was the priority and we only had kafka clients that were internal to the kubernetes cluster, I ended up having to figure out how to externally expose it in a dev env so someone could run some kind of test and I remember being able to get it to work, I also remember it wasn't super simple, that being said if I were to do another Kafka on Kubernetes Project I'd recommend looking into Strimzi Kafka Operator, as it's more flexible in terms of options for externally exposing Kafka, and it's got a great 5 part deep dive write up with different options for externally exposing a Kafka Cluster running on Kubernetes using Strimzi (via NP, LB, or Ingress) (I'm not sure what Strimzi's security looks like though, so I'd recommend using something like AnchorCLI to do a left shift CVE scan of the Strimzi images before trying a PoC)
https://strimzi.io/blog/2019/04/17/accessing-kafka-part-1/


Solutions so far weren't quite satisfying enough for myself, so I'm going to post an answer of my own. My goals:

  1. Pods should still be dynamically managed through a StatefulSet as much as possible.
  2. Create an external service per Pod (i.e Kafka Broker) for Producer/Consumer clients and avoid load balancing.
  3. Create an internal headless service so that each Broker can communicate with each other.

Starting with Yolean/kubernetes-kafka, the only thing missing is exposing the service externally and two challenges in doing so.

  1. Generating unique labels per Broker pod so that we can create an external service for each of the Broker pods.
  2. Telling the Brokers to communicate to each other using the internal Service while configuring Kafka to tell the producer/consumers to communicate over the external Service.

Per pod labels and external services:

To generate labels per pod, this issue was really helpful. Using it as a guide, we add the following line to the 10broker-config.yml init.sh property with:

kubectl label pods ${HOSTNAME} kafka-set-component=${HOSTNAME}

We keep the existing headless service, but we also generate an external Service per pod using the label (I added them to 20dns.yml):

apiVersion: v1
kind: Service
metadata:
  name: broker-0
   namespace: kafka
spec:
  type: NodePort
  ports:
  - port: 9093
    nodePort: 30093
selector:
  kafka-set-component: kafka-0

Configure Kafka with internal/external listeners

I found this issue incredibly useful in trying to understand how to configure Kafka.

This again requires updating the init.sh and server.properties properties in 10broker-config.yml with the following:

Add the following to the server.properties to update the security protocols (currently using PLAINTEXT):

listener.security.protocol.map=INTERNAL_PLAINTEXT:PLAINTEXT,EXTERNAL_PLAINTEXT:PLAINTEXT
inter.broker.listener.name=INTERNAL_PLAINTEXT

Dynamically determine the external IP and for external port for each Pod in the init.sh:

EXTERNAL_LISTENER_IP=<your external addressable cluster ip>
EXTERNAL_LISTENER_PORT=$((30093 + ${HOSTNAME##*-}))

Then configure listeners and advertised.listeners IPs for EXTERNAL_LISTENER and INTERNAL_LISTENER (also in the init.sh property):

sed -i "s/#listeners=PLAINTEXT:\/\/:9092/listeners=INTERNAL_PLAINTEXT:\/\/0.0.0.0:9092,EXTERNAL_PLAINTEXT:\/\/0.0.0.0:9093/" /etc/kafka/server.properties
sed -i "s/#advertised.listeners=PLAINTEXT:\/\/your.host.name:9092/advertised.listeners=INTERNAL_PLAINTEXT:\/\/$HOSTNAME.broker.kafka.svc.cluster.local:9092,EXTERNAL_PLAINTEXT:\/\/$EXTERNAL_LISTENER_IP:$EXTERNAL_LISTENER_PORT/" /etc/kafka/server.properties

Obviously, this is not a full solution for production (for example addressing security for the externally exposed brokers) and I'm still refining my understanding of how to also let internal producer/consumers to also communicate with the brokers.

However, so far this is the best approach for my understanding of Kubernetes and Kafka.