4. Experimental Infrastructure

Based on concrete feedback coming from system and customers, different changes may be introduced to the system. Any change to the system, though intended to improve services, can also lead to failures. However, things will have to change to evolve and integrate best practices and innovations. Platform should provide capabilities to experiment with the new innovations / improvements in controlled way and accelerate the process of change.

Figure below captures various elements involved in implementing this practice.

Overview

It brings out five key areas to focus, which are explained in following sections:

4.1 Data Collection

4.1.1 How to get it?

For understanding our customers at Jio, we must get greater insight into how they behave in real time to drive faster business decisions. Real time data is needed from system to predicting any faults. Platform should be able to seamlessly provide this information to developers to provide quick implementation.

4.1.2 What to record?

Basic questions which we can cover:

  • Who is our customer?
  • How long are they engaged with our product?
  • What actions they are taking?
  • What triggers them to talk positive about us?

4.1.3 From where to get?

On an average people spend 5 hours a day on social media. Clearly the best place to reach our customers. We can gather information from the social media and can try to run a compelling, authentic campaign that involves real people plus a “social impact” element.

4.1.4 What is the process to use this data?

As shown in diagram below, data can be coming form multiple sources, Apps, System, DB, Sensors etc. Data is gathered and analyzed to find different patterns. Patterns drive the decision making for various stakeholders to improve systems, new features and hence delivering value to customer. Transparency and availability of these systems is key to respond faster to customer needs.

InfluxDB

Refer Self-Service Data section for more details on this practice.

4.1.2 Tools

InfluxData provides a Modern Time Series Platform, designed from the ground up to handle metrics and events.

4.1.2.1 TICK :- Telegraf, InfluxDB, Chronograf, and Kapacitor

Telegraf :- It is a plugin-driven server agent for collecting and reporting metrics. Telegraf has plugins or integrations to source a variety of metrics, pull metrics from third-party APIs, or even listen for metrics via a StatsD and Kafka consumer services. It also has output plugins to send metrics to a variety of other datastores, services, and message queues, including InfluxDB, Graphite, OpenTSDB, Datadog, Librato, Kafka, MQTT, NSQ, and many others.

Kafka :-Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

InfluxDB :-InfluxDB is used as a data store for any use case involving large amounts of timestamped data, including DevOps monitoring, application metrics, IoT sensor data, and real-time analytics.

Chronograf :- Chronograf is an open-source web application written in Go and React.js that provides the tools to visualize your monitoring data and easily create alerting and automation rules.

Kapacitor :-Open source framework for processing, monitoring, and alerting on time series data.

TICK

4.1.3 References

4.1.3.1 Websites

4.1.3.2 Videos and Tutorials

4.2 Canary Releasing Process

Canary

4.2.1 Canary strategy

Implementing canary releases means adjusting our deployment strategy. Our new process might look like this:

  • Development: We work on a new feature using a local copy of the app.

  • Continuous: The latest version of our app is hosted on a private server. This allows us to review it in an environment closer to the production one, and to have peer review from our colleagues. If we discover a bug here which our test suite didn’t catch, we will go back to the first step and fix it.
  • Staging: At this point we’re happy with the state of our app. The feature is working as expected and any bugs we may have found are fixed. The latest version of the website is pushed to another private website where customers can review.

  • Canary Deploy: A canary production release is spun up straight after staging, but not shown to any real users yet. This gives us an opportunity to check everything is okay first.
  • Canary Trial: A percentage of traffic to our live website is directed to this canary instance, so we can test with a subset of users.

Following a trial, we have two options.

  • Canary Release: Everything went well, new users on the canary website are not experiencing any new errors. The canary app scales up and replaces the old production app, all users are now on the latest version of production. This is analagous to the ‘flip’ step from before.
  • Canary Rollback: Users on the canary website are experiencing errors, so we turn off the canary instance, and they are (invisibly) sent back to the old production site. All users are now on the existing production site and we can fix the bugs we caught before pushing again.

Canary release is a technique that helps reduce the impact of negative changes by gradually rolling out the changes. If a problem with the new release is detected during the rollout then it can be rolled back, and only a subset of the traffic will have been impacted.

CANARY

4.2.1 Canary deployment with NGINX

Let us now describe an example traffic-control configuration that we want to achieve (as shown in the figure below):

  • We want to deploy three separate instances of the applications.We call these versions “alpha” (early-adopter least-tested version), “beta” (believed to be ready for general availability), and “ga” (hardened, general availability version).

  • We identify clients coming from the “employees” and “tester” groups, based on their public IP addresses. We want to send 100% of traffic from “employees”, and 30% of traffic from “tester” group to the “alpha” instance. The “beta” instance will get the rest of the “tester” traffic and also 1% of the public traffic. Finally, 99% of the public traffic should go to the “ga” instance.

Using the high performance and reverse-proxy server i.e NGINX we can handle the incoming request.

For more information click here

CANARY nginx

4.2.2 References

4.2.2.1 Websites

Canary Release

Canary release in AWS

Feature toggle tools

Canary deployment with nginx

How do facebook and google manage software releases without causing major problems

4.2.2.2 Videos and Tutorials

4.2.2.3 Gurus and Blogs

Guru : Martin Fowler

Blogs :

Canary Release

Parallel Change

4.3 Toggle Architecture

Feature toggle

Feature toggles are a powerful technique, allowing teams to modify system behavior without changing code. They fall into various usage categories, and it’s important to take that categorization into account when implementing and managing toggles. Toggles introduce complexity. We can keep that complexity in check by using smart toggle implementation practices and appropriate tools to manage our toggle configuration, but we should also aim to constrain the number of toggles in our system.

“Feature Toggling” is a set of patterns which can help a team to deliver new functionality to users rapidly but safely. In this article on Feature Toggling we’ll start off with a short story showing some typical scenarios where Feature Toggles are helpful. Then we’ll dig into the details, covering specific patterns and practices which will help a team succeed with Feature Toggles.

Feature toggle architecture is used for releasing the software updates(new features) which can be rolled back when something goes wrong with new release. New features are introduced to users and behaviour of the same is observed.

A feature flag changes the runtime behavior of your application depending on a configuration. The configuration can be:

  • per user or user attributes (country, name, group membership etc.),
  • per environment (machines, scaling units, tenants, networks etc.),
  • randomly (X percent of all users)

or a mix of the above. This is really powerful, because you can develop long term features inside the master branch and release them when you are ready. But it is also dangerous, because you have to maintain the compatibility between your features on all levels (persistence, UI etc.) and the complexity to test all runtime alternatives may increase dramatically.

4.3.1 Feature Flag Frameworks

The first instinct is to just use the configuration system of your application and write your own framework for feature flags. But when you think more about it, this has some disadvantages. A framework for feature flags should:

  • Allow the management of the flags outside of your application
  • Allow you to change the configuration during runtime without any downtime
  • Switch the configuration at once (on all servers and in all components)
  • Have a minimal fingerprint / a very high performance
  • Be failsafe (return a default value when the service is not available)
  • Allow you to change the configuration per user, machine, percentage

Implementing a framework that meets these requirements is pretty complex.

There are a lot of open source frameworks for the different languages. For Java there are Togglz, FF4J, Fitchy and Flip. For .Net there are FeatureSwitcher, NFeature, FlipIt, FeatureToggle or FeatureBee. Some use strings, some enums and some classes – but none has a high scalable backend and a portal to manage your flags (at least not that I know).

4.3.2 Feature Flags and Technical Debt

If you start with feature flags the chance is high, that it gets really complex after some time. So when Jim Bird writes that Feature Toggles are one of the Worst Kinds of Technical Debt, it is for a reason. So how do you use feature flags “the right way”?

The first thing is, that not all feature flags are the same and you should not treat them that way. There are short-lived feature flags, that are used to roll out new features or conduct experiments. They live for some time and then go away. But there are also feature flags that are intended to stay – like flags that handle licensing (like advanced features etc.). And there are mid-term flags for major features that take a long time to develop. So the first thing to do is to create a naming convention for the flags. You may prefix your flag names with short-, temp-, mid- or something like that. So everyone knows, how the flag is intended to be used. Make sure to use meaningful names – especially for the long-lived flags – and manage them together with a long description in a central place.

Mid and long term flags should be applied on a pretty high level. Like bootstrapping your application or switching between micro services. If you find a mid or long term flag in a low level component you can bet this is technical debt.

Short term flags are different. They may need to reside on different levels and are therefore more complex to handle. It is a good idea is to use special branches to manage the cleanup of flags. So right when you introduce a new feature flag, you create a cleanup branch that removes all the flags and submit a pull request for it.

4.3.3 A/B Testing

A/B testing (also known as split testing or bucket testing) is a method of comparing two versions of a webpage or app against each other to determine which one performs better. AB testing is essentially an experiment where two or more variants of a page are shown to users at random, and statistical analysis is used to determine which variation performs better for a given conversion goal.

A/B testing Running an AB test that directly compares a variation against a current experience lets you ask focused questions about changes to your website or app, and then collect data about the impact of that change. Testing takes the guesswork out of website optimization and enables data-informed decisions that shift business conversations from “we think” to “we know.” By measuring the impact that changes have on your metrics, you can ensure that every change produces positive results.

4.3.3.1 Why A/B Testing?

  • A/B testing allows individuals, teams, and companies to make careful changes to their user experiences while collecting data on the results.
  • A/B testing can be used consistently to continually improve a given experience, improving a single goal like conversion rate over time.

4.3.3.2 A/B Testing Process

A/B testing

4.3. Use Cases

(1) Deploy Latent code

(2) Develop new/ impactful features on the same trunk/master without affecting release priority features

(3) Ship alternate code paths within one deployable unit and choose between them at runtime

4.3.3 References

4.3.3.1 Websites

THERE IS NO DEVOPS WITHOUT FEATURE FLAGS!

Feature toggle benefits and risk

Different A/B testing tools

A/B Testing

4.3.3.2 Videos and Tutorials

This is provides more insight about feature toggle.Basic concepts are explained.

This videos explains more details about the feature toggle.

4.3.3.3 Gurus and Blogs

Guru: Martin Fowler

Blog : Feature toggle

4.4 Routing Technology

4.4.1 Blue-green deployments

Advanced deployment strategies such as blue-green deployments and rolling deployments are critical for managing multinode installations of applications that must be updated without interruption. Blue-green deployments fit these requirements because they provide smooth transitions between versions, zero-downtime deployments, and quick rollback to a known working version.

Blue-green deployment is a release technique that reduces downtime and risk by running two identical production environments called Blue and Green. At any time, only one of the environments is live, with the live environment serving all production traffic. For this example, Blue is currently live and Green is idle. As you prepare a new release of your software, deployment and the final stage of testing takes place in the environment that is not live: in this example, Green. Once you have deployed and fully tested the software in Green, you switch the router so all incoming requests now go to Green instead of Blue. Green is now live, and Blue is idle.

This technique can eliminate downtime due to application deployment. In addition, blue-green deployment reduces risk: if something unexpected happens with your new release on Green, you can immediately roll back to the last version by switching back to Blue.

CANARY

4.4.1 Blue-green deployment with Pivotal Cloud Foundry Implementations

  • Autopilot: Autopilot is a Cloud Foundry Go plugin that provides a subcommand, zero-downtime-push for hands-off, zero-downtime application deploys.
  • BlueGreenDeploy: cf-blue-green-deploy is a plugin, written in Go, for the Cloud Foundry Command Line Interface (cf CLI) that automates a few steps involved in zero-downtime deploys.

4.4.3 References

4.4.3.1 Websites

Blue-Green Deployment with Pivotal Cloud Foundry

How does blue-green deployment work with AWS

AWS services used in Blue/Green deployments?

4.4.3.2 Videos and Tutorial

4.4.3.3 Gurus and Blogs

Guru :Martin Fowler

Blog : Blue Green Deployment

4.5 Visualization and Instrumentation

Data Visualization

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns. With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.

Data Visualization

Industry-renowned data visualization expert Edward Tufte once said: “The world is complex, dynamic, multidimensional; the paper is static, flat. How are we to represent the rich visual world of experience and measurement on mere flatland?” He’s right: There’s too much information out there for knowledge workers to effectively analyze — be they hands-on analysts, data scientists, or senior execs. More often than not, traditional tabular reports fail to paint the whole picture or, even worse, lead you to the wrong conclusion. AD&D pros should be aware that data visualization can help for a variety of reasons:

  • Visual information is more powerful than any other type of sensory input. Dr. John Medina asserts that vision trumps all other senses when it comes to processing information; we are incredible at remembering pictures. Pictures are also more efficient than text alone because our brain considers each word to be a very small picture and thus takes more time to process text. When we hear a piece of information, we remember 10% of it three days later; if we add a picture, we remember 65% of it. There are multiple explanations for these phenomena, including the fact that 80% to 90% of information received by the brain comes through the eyes, and about half of your brain function is dedicated directly or indirectly to processing vision.
  • We can’t see patterns in numbers alone . . . Simply seeing numbers on a grid doesn’t always give us the whole story — and it can even lead us to draw the wrong conclusion. Anscombe’s quartet demonstrates this effectively; four groups of seemingly similar x/y coordinates reveal very different patterns when represented in a graph.
  • . . . and a single visualization may not tell the whole story. Even after using data visualization to understand a pattern or to see a correlation, the numbers alone don’t tell the whole story. Arranging data visualizations in a sequence to demonstrate cause and effect, journey time, strategy (such as current state, target state, or road map), and other stories can create powerful presentations that make an impact.
  • It’s often the only way to fit all the data points onto a single screen. Even with the smallest reasonably readable font, single line spacing, and no grid, you can’t fit more than a few hundred numbers onto a screen. However, data visualization techniques allow you to fit tens of thousands of data points — a difference of several orders of magnitude — into a single figure that fits on a screen. Edward Tufte gives an example that displays more than 21,000 data points on a US map that fits onto a single screen.
  • It’s hard to analyze broad data sets without it. While fitting deep data sets that contain billions of rows of data onto a single screen can be challenging, various data aggregation and grouping techniques can solve this challenge. But analyzing broad data sets that contain hundreds, and often thousands, of columns is an entirely different challenge. At most, basic data visualizations can show five attributes (two axes plus the size, color, and shape of the metric) — or six if you visualize the data on a map to show location. But a pharma company analyzing typical patient characteristics in a drug trial uses thousands of attributes — physical, psychological, genetic, behavioral, etc. — and so requires different data visualization techniques.
  • It “democratizes” data consumption for a wider audience. Data visualization is no longer just the realm of analysts; it helps executives, operations departments, and nontechnical business professionals like brand marketers consume insights from data analytics and make better business decisions.

Few open source tools available in market for data visualization are - Metabase, Dashing, Grafana, etc.

4.5.2 References

4.5.2.1 Websites

Data Visualization Build More Effective Data Visualizations

4.5.2.2 Videos and Tutorials