Categories for Uncategorized

CFP: Making Sense of Metrics: Crafting and Leveraging Prometheus Metrics for Infrastructure Intelligence

February 29, 2024 9:28 am Published by

Audience

This talk is targeted at System Administrators and Site Reliability Engineers interested in learning about how to best make sense of the Prometheus metrics their system exposes. If you know PromQL, but the queries behind your dashboards are still a mystery to you, you are not alone. This talk will show how to get information out of your metrics to maximize the insights and make data-based decisions.

Outline

Creating new metrics and collecting them with Prometheus is easier today than it was ever before. Site Reliability Engineers and System Administrators have all the data at hand they need to make the right, data-based decisions. But how?
Making sense of all that information is still a challenge. Crafting the right PromQL query to answer your question and manifesting it in a Grafana dashboard is a complex and time-consuming task. Not speaking of understanding that query when you need to change it a few weeks later.

In this session, you will see different approaches to make sense of the prometheus metrics exposed by a software deployment: Starting from the default Prometheus UI, via PromLens, an improved, open source query-building UI, all the way to an experiment on transforming Prometheus metrics into a data warehouse for improved data exploration and visualization. Data analysts have used Business intelligence software for decades. What can we learn from these systems to discover knowledge in the ocean of metrics to make better decisions for our infrastructure?

Key Takeaways

During this talk, attendees should have learned how to (1) best explore and query the available metrics in their environment, (2) which tools are available today and (3) how infrastructure intelligence can leverage data warehouse concepts for improved knowledge discovery and decision making.

Pod Autoscaling in OpenShift and Kubernetes

November 24, 2022 11:05 am Published by

To test out the horizontal pod autoscaling I built a simple example application that shares “load” across instances. The load is as simple as allocating a configurable chunk of memory.

When the load increases, the pod autoscaler would spin up new instances so the chunks get smaller and the target is reached.

You can find the example application on Github.

Run locally

Check out the repository, run make build and start ./pod-autoscaler-example.

It will allocate the default 100MiB of RAM and wait for additional instances to connect:

$ ./pod-autoscaler-example 
Can't connect to main instance at localhost:8081, assuming there is none and acting as mainWaiting for server to be responding...
Instance ID: main
Allocating 100 Mi
Instance ID: main Variable size: 100 Mi Allocated: 100 Mi
Instance ID: main Variable size: 100 Mi Allocated: 100 Mi
Instance ID: main Variable size: 100 Mi Allocated: 100 Mi

Starting a second instance will share the “load” across instances:

$ ./pod-autoscaler-example 
Instance ID: nXii9k5RhdcdwYcBKw6hRT
Can't run worker server, assuming we run on one host: listen tcp :8081: bind: address already in useAllocating 50 Mi
Instance ID: nXii9k5RhdcdwYcBKw6hRT Variable size: 50 Mi Allocated: 50 Mi
Instance ID: nXii9k5RhdcdwYcBKw6hRT Variable size: 50 Mi Allocated: 50 Mi
Instance ID: nXii9k5RhdcdwYcBKw6hRT Variable size: 50 Mi Allocated: 50 Mi
Instance ID: nXii9k5RhdcdwYcBKw6hRT Variable size: 50 Mi Allocated: 50 Mi

Increase the workload (this is what should trigger the autoscaler later) using curl:

$ curl localhost:8081/set?mem=300

All instances should get the new workload size now:

Allocating 150 Mi
Instance ID: nXii9k5RhdcdwYcBKw6hRT Variable size: 150 Mi Allocated: 150 Mi
Instance ID: nXii9k5RhdcdwYcBKw6hRT Variable size: 150 Mi Allocated: 150 Mi
Instance ID: nXii9k5RhdcdwYcBKw6hRT Variable size: 150 Mi Allocated: 150 Mi

Stopping the main instance will result in a re-election workflow (not too suffisticated, yet, but it should keep the workload size):

Error sending request to getInstanceInfo: Get "http://192.168.178.46:8082/getInstanceInfo": dial tcp 192.168.178.46:8082: connect: connection refused
Assuming main is not responding anymore. Checking who can take over.
New main: nXii9k5RhdcdwYcBKw6hRT My ID: nXii9k5RhdcdwYcBKw6hRT
Taking over...
Waiting for server to be responding...
Instance ID: main
Allocating 300 Mi
Instance ID: main Variable size: 300 Mi Allocated: 300 Mi
Instance ID: main Variable size: 300 Mi Allocated: 300 Mi

Since there are only two instances left, the workload per instance is increased.

A metrics endpoint is available so you can observe the application using prometheus:

$ curl localhost:8081/metrics
instances_count 2
workload_mib 300
chunksize_mib 150

Running on OpenShift/Kubernetes

Resources in the deploy folder can be used to run it on a Kubernetes or OpenShift cluster.

The deployment in deploy.yaml will create a single instance, the service and route (svc.yaml and route.yaml) are used to expose it. Note that the command passes the service hostname and port to the containers so they can initiate communication when new instances come up.

The horizontal pod autoscaler in hpa.yaml will increase and decrease instances as necessary when the load increases/decreases.

Using the route, you can scale the workload with curl as shown above:

$ curl pod-autoscaler-example-autoscaler-example.apps.mycluster.com/set?mem=200

The servicemonitor in servicemonitor.yaml makes the metrics available for Prometheus. When using OpenShift, you can query them in the OpenShift console:

you can see how the chunk size first went up when the workload was increased, and then back down when the autoscaler kicked in and scaled the deployment.

Kubernetes: Distributing Pods of a Deployment across nodes

May 17, 2022 9:14 am Published by

Sometimes you need to ensure that the pods of a deployment are not deployed to the same node. To achieve this, you can use the pod anti-affinity and configure it so that pods do not get deployed to pods of the same deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: game
  name: game
  namespace: arcade
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: game
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: game
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - game
            topologyKey: kubernetes.io/hostname
      containers:
      - image: quay.io/mdewald/s3e
        name: s3e

This pod anti-affinity definition will not deploy any 2 pods of the deployment onto the same node.

During the roll-out, additional pods are created before old pods are removed. If you have the same number of nodes as replicas, that means the roll-out won’t happen: No node is available to suffice the criteria to deploy an additional pod. So in the best case, you should have more nodes available than the deployment requires replicas.

You can work around this problem by changing from requiredDuringSchedulingIgnoredDuringExecution to preferredDuringScheduilingIgnoredDuringExecution:

podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
  - podAffinityTerm:
      labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - game
        topologyKey: kubernetes.io/hostname
    weight: 100

However, this would allow some of the pods of the deployment to land on the same node during a roll-out of the deployment. After the roll-out, they will be distributed one pod per node again.

If you absolutely don’t want to ever have 2 pods of the same deployment run on the same node but don’t have more nodes than replicas, it can be an option for you to migrate from a Deployment to StatefulSet, which will first terminate each pod before creating a new one:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: game
  name: game
  namespace: arcade
spec:
  replicas: 2
  selector:
    matchLabels:
      app: game
  serviceName: ""
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: game
    spec:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - game
            topologyKey: kubernetes.io/hostname
      containers:
      - image: quay.io/mdewald/s3e
        name: s3e

This will ensure no pod of the StatefulSet is scheduled to the same node. If you have the same number of nodes as replicas in the StatefulSet the rollout will do the following: One by one, the pods will be removed and the replacement will be scheduled to the same node before the next pod is removed.

CFP: Up your Game with OpenShift

October 21, 2021 9:18 am Published by

Automatic build and deployment for your very own browser game

Audience

This talk is targeted at anybody using or planning to use OpenShift to deploy their software. It’s not exactly for game developers, but probably for people who want to learn what it can mean to build a game with Open Source software and deploy it to an OpenShift cluster so it can be accessed via a web browser, or people who want to know about the possibilities to automate software build and deployment with OpenShift.

Outline

OpenShift comes with various built-in tools that can help developers bring their own software to the cloud – or their cluster.

From automatic builds based on dockerfiles via source-to-image (S2I) builds to a full-blown pipeline with Tekton, a solution is there for all use-cases one can think of. All use-cases? How about the small browser-game developer around the corner?

In this talk we’ll explore the possibilities to provide a custom S2I image for very specific use-cases, like building and running a game built with the Godot engine. We will use the image to trigger automatic builds and deployments of a small browser game whenever a change is published to the games’ GitHub repository.

Key Takeaways

During this talk the audience will see (1) the different possibilities OpenShift provides to automate and customize a build process, (2) how to build a custom builder image for S2I builds, and (3) get an idea of which option could be a good fit for their own projects.

CFP: The CR that goes around in circles

April 13, 2021 1:18 pm Published by

A tale of common pitfalls in operator and CRD design and how to avoid them

Audience

This talk is targeted at an audience that already made some experience or interest in designing and implementing Kubernetes operators. The terms Custom Resource Definition (CRD), Custom Resource (CR) and operator pattern should be known to them. For examples that are mentioned during the talk, it is based on the Operator SDK, however most patterns are independent of the framework in use.

Outline

When designing and implementing Kubernetes operators, we naturally need to think about CRDs and how a good CRD looks like. While operators often act on existing Kubernetes resources, in most cases we want to build our own objects, which is one of the main advantages of operators: Building and operating on custom types that seamlessly integrate into the Kubernetes control plane.

In this talk we will shed light on some common pitfalls, and share some considerations for designing new CRDs and operators. The goal is to enable everyone to build and run operators that work smoothly and avoid CRs that go around in circles.

Key Takeaways

During this talk the audience will see (1) common mistakes in operator design that lead to behaviors like CRs that never leave the reconcile loop, (2) learn how to avoid them, and discover (3) how an ideal CRD should look like that minimizes the risk of such mistakes.

Things I always search for when writing a new BASH script

November 25, 2020 8:16 am Published by

Get the directory of the script

When writing a bash script, especially when it is distributed via a git repository and depends on other files in the same repository, it is often important to know the location of the script to use relative paths to other files. Particularly, getting the directory of the current script allows users to execute the script from wherever they want and the script can still depend on files similar to using a relative path. This one-liner is what I typically do to get the location of the current script, found on Stack Overflow.

DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

Building Kubernetes Operators

November 17, 2020 4:33 pm Published by

In the past months ad Red Hat I dove a bit into operator development. During learning how to build Kubernetes operators myself I published a quick start article on opensource.com . Using Operator SDK is a good way to kick-start your own Kubernetes operator.

While the post itself is based on an older version of the SDK and usage has changed, you should be able to get such an example operator up and running quickly using this post and a later version of the Operator SDK.

While working on different operators, I wrote down patterns and best practices I learned or discovered in the different projects, which I wrote down on a blog post on openshift.com. The CFP CFP: How to build a Kubernetes operator that doesn’t break production is mainly based on this blog post. If you want to build your own Operator, those two blog posts may help you build an operator that is easy to build, maintain and run.

CFP: How to build a Kubernetes operator that doesn’t break production

September 21, 2020 8:46 am Published by

Audience

This talk is targeted at software developers and SREs interested in development practices for Kubernetes operators. Are they interested in how development of an operator is different from other software projects? This project will give an outline of the operator pattern and how development looks like, focusing on the importance of good engineering practices. Are they writing a Kubernetes operator just to automate a simple task? They should write tests for it, and this talk will tell them why. As Site Reliability Engineers in OpenShift Dedicated, we’re developing and maintaining a number of operators to keep toil on all our operated clusters as low as possible.

Outline

In a recently published blog post I wrote about how to make sure a Kubernetes operator project is maintainable and follows software development best practices. As SREs we create and maintain a growing number of Operators to keep toil away from us. But a poorly designed, implemented or tested operator can just create toil on its own by not functioning correctly. Adding new features to it can get hard for SREs as bugs can go in undiscovered and the confidence in adding new code can be low if the operator lacks an adequate test suite.

In this talk I will talk about the important concepts you should keep in mind when developing your own Kubernetes operator. Even if you want to start a new project just to automate the setup and configuration of a small application, make sure to give all the attention to good software development practices it needs, even if you feel this could slow down the development and even take you more time than just performing that task by hand. Software grows, and in the long run it will pay out if you craft a tested and readable operator from the beginning.

Key Takeaways

During this talk, attendees should have learned the importance of (1) treating a Kubernetes operator as production code. (2) It is very helpful to wrap external dependencies, where (3) tests will help achieve this goal as well as help improve the overall structure of the code.

CFP: 5 agile practices and why they are useful to SRE teams

April 21, 2020 2:28 pm Published by

As SRE (Site Reliability Engineering) teams contain a fair portion of software development work, and get filled up by software developers, it is a natural move to also adapt agile software development practices. The right agile model depends heavily on the percentage of development work vs. operations, which may be influenced by the team size. For example, in a small team where a high percentage of people is on call during the day, it might not make too much sense to plan sprints of 2 weeks if only a few backlog items are expected to get done in that timeframe.

Audience

This talk is targeted at everyone involved in Site Reliability Engineering, wondering how much agile to adopt – team leads, product owners, software developers, SREs. If you’re planning to transform your ops team into an SRE team, your SRE team just got started, or already do SRE since quite some time. As a software engineer who recently joined SRE, I will talk about which practices I found useful to take over from software engineering, which ones are better dropped, and which ones I’m still missing sorely.

Agile Practices

Retrospective

While often being the first meeting to get dropped by teams as the relation to actual work items cannot be seen easily, the retrospective meeting is the tool for teams to iterate on how they work and improve, including which of the agile practices make sense to adopt it which don’t.

Planning: Estimating Backlog Items

Planning meetings help the team understand priorities of items, the overall direction a project is heading and get a common understanding of how complex work is (with estimation). However, given a (not known) number of people is on call or doing incident response makes it hard to set sprint goals or commit to a consistent number of stories.

Standups

Standup meetings are useful, especially in distributed teams, to talk about what you’re working on and where you need help. Frequency of the meeting does not necessarily be daily – and that hit me as software engineer unexpectedly hard.

Testing

If your SRE team is writing software, that software should be tested. No room for discussion.

That’s what the software engineer might think – but you need to discuss. You need to convince your team testing is helpful. And that’s as equally hard in an SRE team as in any software engineering team.

Pair programming

It’s hard to convince people pair programming is helpful, and it isn’t helpful in every situation – but confidence in code as well as operations changes (in an outage for example) is so much higher when working in a pair.

Key Takeaways

During this talk, attendees should have learned (1) that SRE and software engineering likewise benefit from agile development practices, of which at least (2) some practices are worth to adopt while others may not be too helpful for SRE. (3) Which ones are and are not helpful can be the easiest spotted by iterating not only work but also how we work (practice retrospectives).

Guest article: Build a Kubernetes Operator in 10 minutes with Operator SDK

April 20, 2020 1:53 pm Published by

It’s been a while since I last submitted an article to opensource.com. This time it is about quickly kick-starting a Kubernetes Operator with Operator SDK. Click here to get to the article.

When you start working on a new software project, often a bunch of code is already existing. That’s by no means different when joining development of a Kubernetes Operator. In the case of Operator SDK a good part of the code is additionally generated, so you also want to know which code is hand-written, meant for changes, and which is generated by the SDK.

As I’ve been working on the GCP Project Operator with my team at Red Hat, I wanted to know what exactly the steps are to start an operator from scratch, to better understand what it is, that you get from the SDK. I thought it might also be useful for other hopping on operator development, so I wrote those steps down in a blog article.