Flagger – Canary deployments on Kubernetes
Fabian Piau | Tuesday May 19th, 2020 - 07:56 PMOctober, 17th, 2020 : Use newer versions (Helm 3, Kube 18, Istio 1.7, Flagger 1.2).
This article is the second one of the series dedicated to Flagger. In a nutshell, Flagger is a progressive delivery tool that automates the release process for applications running on Kubernetes. It reduces the risk of introducing a new software version in production by gradually shifting traffic to the new version while measuring metrics and running conformance tests.
Make sure you have a local Kubernetes cluster running with the service mesh Istio. If you don’t, read the first article: Flagger – Get Started with Istio and Kubernetes.
In this second guide, we will focus on the installation of Flagger and run multiple canary deployments of the application Mirror HTTP Server (MHS). Remember that this dummy application can simulate valid and invalid responses based on the request. This is exactly what we need to test the capabilities of Flagger. We will cover both happy (rollout) and unhappy (rollback) scenarios.
This is a hands-on guide and can be followed step by step on MacOS. It will require some adjustments if you are using a Windows or Linux PC. It is important to note that this article will not go into details and only grasp the concepts & technologies so if you are not familiar with Docker, Kubernetes, Helm or Istio, I strongly advise you to check some documentation yourself before continuing reading.
Installing Flagger
Let’s install Flagger by running these commands.
We install Flagger in its own namespace flagger-system
.
kubectl apply -f https://raw.githubusercontent.com/weaveworks/flagger/master/artifacts/flagger/crd.yaml
helm upgrade -i flagger flagger/flagger \
--namespace=flagger-system \
--set crd.create=false \
--set meshProvider=istio \
--set metricsServer=http://prometheus.istio-system:9090
Reference: Flagger Install on Kubernetes
Flagger depends on Istio telemetry and Prometheus (in that case, we assume Istio is installed in theistio-system
namespace).
All parameters are available on the Flagger readme file on GitHub.
We don’t specify a version for Flagger, which means it will use the latest available in the repo (1.2.0
at the time of writing).
After a few seconds, you should get a message confirming that Flagger has been installed. From the Kube dashboard, verify that a new namespace has been created flagger-system
and the Flagger pod is running.
Experiment 0 – Initialize Flagger with MHS v1.1.1
Mirror HTTP Server has multiple versions available. To play with Flagger canary deployment feature, we will switch between version 1.1.1
, 1.1.2
and 1.1.3
of MHS (the latest version at the time of writing).
Before deploying MHS, let’s create a new namespace application
, we don’t want to use the default one at the root of the cluster (this is good practice). The name is too generic, but sufficient for this tutorial, in general you will use the name of the team or the name of a group of features.
Do not forget to activate Istio on this new namespace:
To deploy MHS via Flagger, I created a Helm chart.
This “canary flavored” chart was created based on the previous chart without Flagger which itself was created with the helm create mhs-chart
command, then adapted. In this “canary flavored” chart, I did some extra adaptation to use 2 replicas instead of 1 to make it more realistic and use a fixed version to 1.1.1
, I also added the canary resource where the magic happens.
Clone the chart repo:
And install MHS:
helm install --name mhs --namespace application ./mhs
After a few moments, if you look at the dashboard, you should see 2 replicas of MHS in the namespace application
.
It is important to note that no canary analysis has been performed and the version has been automatically promoted. It was not a “real” canary release.
Why? Because Flagger needs to initialize itself the first time we do a canary deployment of the application. So make sure the version you are deploying with Flagger the first time is fully tested and works well!
You could also guess this auto-promotion happened because there was no initial version of the application in the cluster. Although this is obviously a good reason, it’s important to note that, even if we had a previous version before (e.g.1.1.0
), the canary version1.1.1
would have still been automatically promoted without analysis.
You can still check the canary events with:
You should have a similar output without a canary analysis:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Synced 2m29s flagger mhs-primary.application not ready: waiting for rollout to finish: observed deployment generation less then desired generation
Normal Synced 92s (x2 over 2m30s) flagger all the metrics providers are available!
Normal Synced 92s flagger Initialization done! mhs.application
Or you can also directly check the log from Flagger:
kubectl -n flagger-system logs $FLAGGER_POD_NAME
If you take a closer look at the Kube dashboard, you should see some mhs
and mhs-primary
resources:
mhs-primary
are the primary instances (= the non-canary ones). Flagger automatically add the-primary
suffix to differentiate them from the canary instances.mhs
are the canary instances. They exist only during the canary deployment and will disappear once the canary deployment ends. That’s why, in the screenshot above, you don’t see anymhs
canary pods (i.e. 0 / 0 pod).
Why this naming convention? I asked Flagger team directly and there is a technical constraint.
Flagger is now initialized properly and MHS is deployed to your cluster. You can use the terminal to confirm MHS is accessible (thanks to the Istio Gateway):
You should receive an HTTP 200 OK response:
x-powered-by: Express
date: Sun, 17 May 2020 16:47:33 GMT
x-envoy-upstream-service-time: 10
server: istio-envoy
transfer-encoding: chunked
And:
should return an HTTP 500 response:
x-powered-by: Express
date: Sun, 17 May 2020 16:48:09 GMT
x-envoy-upstream-service-time: 12
server: istio-envoy
transfer-encoding: chunked
Experiment 1 – MHS v1.1.2 canary deployment
We are going to install a newer version 1.1.2
. You need to manually edit the file mhs-canary-chart/mhs/values.yaml
and replace tag: 1.1.1
with tag: 1.1.2
(this line).
Then:
helm upgrade mhs --namespace application ./mhs
While the canary deployment is in progress, it’s very important to generate some traffic to MHS. Without traffic, Flagger will consider that something went wrong with the new version and will rollback automatically to the previous one. Obviously, you don’t need this extra step in a production environment that continuously receives real traffic.
Run this loop command in another terminal to generate artificial traffic:
Check the Kube dashboard, you should see the canary pod with the new version 1.1.2
at some point:
Check the canary events with the same command as before:
After a while (about 6 minutes) you should have a similar event output:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Synced 30m flagger mhs-primary.application not ready: waiting for rollout to finish: observed deployment generation less then desired generation
Normal Synced 29m (x2 over 30m) flagger all the metrics providers are available!
Normal Synced 29m flagger Initialization done! mhs.application
Normal Synced 10m flagger New revision detected! Scaling up mhs.application
Normal Synced 9m16s flagger Starting canary analysis for mhs.application
Normal Synced 9m16s flagger Advance mhs.application canary weight 10
Normal Synced 8m16s flagger Advance mhs.application canary weight 20
Normal Synced 7m16s flagger Advance mhs.application canary weight 30
Normal Synced 6m16s flagger Advance mhs.application canary weight 40
Normal Synced 5m16s flagger Advance mhs.application canary weight 50
Normal Synced 4m16s flagger Copying mhs.application template spec to mhs-primary.application
Normal Synced 3m16s flagger Routing all traffic to primary
Normal Synced 2m16s flagger (combined from similar events): Promotion completed! Scaling down mhs.application
The canary release performed successfully. Now you have version 1.1.2
installed on all the primary pods and the canary pod has been removed.
Why did this deployment take about 6 minutes? Because it includes a 5 minutes canary analysis. During this analysis, the traffic was routed progressively to the canary pod. The canary traffic increased by steps of 10% every 1 minute until it reached 50% of the global traffic. The analysis is configurable and defined in the canary.yaml file that was added to the chart.
Below is the analysis configuration:
analysis: # stepper schedule interval interval: 1m # max traffic percentage routed to canary - percentage (0-100) maxWeight: 50 # canary increment step - percentage (0-100) stepWeight: 10 # max number of failed metric checks before rollback (global to all metrics) threshold: 5 metrics: - name: request-success-rate # percentage before the request success rate metric is considered as failed (0-100) thresholdRange: min: 99 # interval for the request success rate metric check interval: 30s - name: request-duration # maximum req duration P99 in milliseconds before the request duration metric is considered as failed thresholdRange: max: 500 # interval for the request duration metric check interval: 30s
The canary analysis has been covered with the 2 basic metrics that are provided out of the box by Istio / Prometheus (request success rate + duration). It is possible to define your own custom metrics. In that case, they will need to be provided by your application. Your application will need to expose a Prometheus endpoint that includes your custom metrics. And you will be able to update the Flagger analysis configuration to use them with your own PromQL query. Note this goes beyond the scope of this hands-on guide that uses only the built-in metrics.
Experiment 2 – MHS v1.1.3 faulty deployment
Again, you need to manually edit the file mhs-canary-chart/mhs/values.yaml
and replace tag: 1.1.2
with tag: 1.1.3
.
Then:
helm upgrade mhs --namespace application ./mhs
We generate some artificial traffic:
This time, we also generate invalid traffic to make sure the request success rate is going down!
Check the canary events with the same command as before:
After a while (about 6 minutes) you should have a similar event output:
Normal Synced 7m23s (x2 over 19m) flagger Advance mhs.application canary weight 10
Normal Synced 7m23s (x2 over 19m) flagger Starting canary analysis for mhs.application
Warning Synced 6m23s flagger Halt mhs.application advancement success rate 57.14% < 99%
Warning Synced 5m24s flagger Halt mhs.application advancement success rate 0.00% < 99%
Warning Synced 3m24s flagger Halt mhs.application advancement success rate 71.43% < 99%
Warning Synced 2m24s flagger Halt mhs.application advancement success rate 50.00% < 99%
Warning Synced 84s flagger Halt mhs.application advancement success rate 63.64% < 99%
Warning Synced 24s flagger Rolling back mhs.application failed checks threshold reached 5
Warning Synced 24s flagger Canary failed! Scaling down mhs.application
And you are still on version 1.1.2
.
Flagger decided not to go ahead and propagate version
1.1.3
as it could not perform a successful analysis and the error threshold was reached, i.e. 5 times (indeed, each time, about 50% of the requests were ending up in an HTTP 500 response). Flagger has simply redirected all traffic back to the primary instances and removed the canary pod.
Congratulations, you’ve come to the end of this second tutorial!
Observations
Before we clean up the resources we’ve created, let’s wrap up with a list of observations:
- Deleting a deployment will delete all pods (canary / primary). And we don’t end up with orphan resources.
- Prometheus is required. Without it, the canary analysis won’t work.
- It is not possible to re-trigger a canary deployment of the same version if it has just failed. It forces you to bump up the version (even if it was a configuration and not a code issue).
- Flagger off-boarding process is not as simple as removing the canary resource from the chart and deploy a new version. If you delete the canary resource then Flagger won’t trigger the canary process, it will change the version in
mhs
and removemhs-primary
butmhs
has 0 pods so it will make your service unavailable! You need to be careful and adopt a proper manual off-boarding process. Recently, the Flagger team added a propertyrevertOnDeletion
you can enable to avoid this issue. You can read the documentation to know more about this canary finalizer. - After multiple deployments, it seems that some events can be missing, the Kubernetes
describe
command is accumulating them (x<int> over <int>m
) sometimes the order is not preserved and/or some events are not showing up. You can look at the phase status (terminal status areInitialized
,Succeeded
andFailed
). The best is to look directly at the logs on the Flagger pod as this is always accurate and complete. - The canary analysis should be configured to run for a short period of time (i.e. no more than 30 minutes) to leverage continuous deployment and avoid releasing a new version while a canary deployment for the previous one is still in progress. If you want to perform canary releases over longer periods, Flagger may not be the best tool.
- Finally, it’s important to remember that the first time you deploy with Flagger (like in experiment 0 above), the tool needs to initialize itself (
Initialized
status) and will not perform any analysis.
Cleaning up resources
Now the tutorial is complete you can remove the MHS application and its namespace.
kubectl delete namespaces application
We recommend that you leave Flagger and Istio in place to save time in the next tutorial. If however you’d like to remove everything now, then you can run the following commands.
Remove Flagger:
kubectl delete namespaces flagger-system
Remove Istio and Prometheus:
istioctl manifest generate --set profile=demo | kubectl delete -f -
kubectl delete namespaces istio-system
What’s next?
The next article will focus on the Grafana dashboard provided out of the box with Flagger which is a nice addition, so you don’t need to manually run any kuberctl
commands to check the result of your canary deployments. Stay tuned! In the meantime, you can stop the Kubernetes cluster by unchecking the box and restarting Docker Desktop. Your computer deserves another break.
Recent Comments