1. Delivery Infrastructure

Applications are the engines that drive competitive advantage and new business opportunity. When modern day organizations say they need to get more agile from a business standpoint, they are talking about delivering new applications quicker. It means delivering complex multi-tier applications across development, test, staging and production.¹[]

A new application delivery typically involves the following:

Infrastructure Identification, Provisioning and Infosec assessment – Providing the fundamental building blocks of computing resource – servers (compute), storage, network, load balancers etc.
Software Platform and Database Provisioning – Comprises of programming languages to interact with services like databases, web servers, and file storage, without having to deal with lower level requirements like how much space a database needs, whether the data must be protected by making a copy between 3 servers, or distributing the workload across servers that can be spread throughout the world. Typically, applications must be written for a specific PaaS offering to take full advantage of the service, and most platforms only support a limited set of programming languages
Application Software Installation and Provisioning – Actual application software created by developers to solve a particular business problem.

In a more mature organization, Infra and Applications are loosely coupled, thereby, Infra provisioning happens in parallel with Application delivery and doesn’t impact the overall timeline for application deployment. For cloud like environments, once setup properly, Infra provisioning is a matter of few clicks on the portal.

In evolving organizations (like us), the Infra provisioning and Software Platform provisioning takes about 5-6 weeks of effort and is thereafter handed over to the Application Team for Application deployment; which usually takes another 1-2 weeks to have a running Alpha release in production. Unless, it is planned properly, most of the effort is done sequentially. It involves a lot of coordination and discussion among multiple stakeholders to ensure that while the application is architect-ed and developed, Infra provisioning also happens in the same time.

The same holds true for upgrades and update cycle whereby either the base OS is updated or machine is replaced.

The following is the list of multiple activities involved as part of the application development and deployment:

Infrastructure Identification, Provisioning and Infosec assessment
- IGF (Information Gathering Form) Submission – For server hall, rack and server identification, disk partitioning, zoning
- Cabling and Switch Configuration
- OS and ILO Setup
- Infosec assessment – BAVA
- Monitoring Tools setup
- User Account Creation (PIM and HPSM)
- Architecture Review and App Assessment
- Self-Signed Certification
- Network Configuration (VIP, Nating, Port Opening, DNS and Firewall configuration)
Software Platform and Database Provisioning
- Programming Language Environment and Runtime setup
- Database installation and configuration
- Software Load Balancer installation and configuration
Application Software Installation and Provisioning
- Deployment Scripts and System configuration
- Application Software Installation and configuration
- Application Monitoring Tools configuration

The below figure depicts the typical lifecycle of above activities and the time it consumes.

Through the use of automation (particularly in Infra provisioning), many of the activities above can be reduced to days or minutes, leaving enough bandwidth with developer to focus on application development and validation. Availability of infrastructure, environments at a click of a button is the magic that is needed. On the other hand, automated journey of release from code change to finally deployed state in production environment can be accelerated at a faster pace.

The need of the hour for us is to adopt elastic infrastructure (explained in following sections) so that the above challenges in Infra Provisioning can be abstracted from Application Development Lifecycle and both these activities can happen independently and without any linkage.

Broadly delivery infrstructure can be covered under 5 areas:

1.1 Elastic Infrastructure
1.2 Continuous Delivery Pipeline
1.3 Security
1.4 Deployment runtime
1.5 Monitoring

1.1 Elastic Infrastructure

Forrester defines an elastic application platform as:

An application platform that automates elasticity of application transactions, services, and data, delivering high availability and performance using elastic resources.

Elastic Infrastructure helps developers create elastic applications by reducing the art of elastic architectures to the science of a platform. It should provide tools, frameworks, and services that automate many of the more complex aspects of elasticity. These includes all the runtime services needed to manage elastic applications, full instrumentation for monitoring workloads and maintaining agreed-upon service levels, cloud provisioning, and, as appropriate, metering and billing systems. It should make it normal for enterprise developers to deliver elastic applications — something that is decidedly not the norm today in our context. ¹ [^11]

“If a service is not scalable and elastic, then it may not be shareable enough. If it is not metered by use, then it may not allow for flexible pricing. Support for more of the attributes opens the door to a great value proposition to the consumer, and greater flexibility and potential cost reduction for the provider.”²

Typically, elastic infrastructure involves providing the elastic resources for Infra, Platform and Software as depicted below:

IaaS - Infrastructure-as-a-Service A cloud service that enables users to get access to their own infrastructure - computers, networking resources, storage. Its worth noting that these are typically virtual resources, but could be real, physical resources.

PaaS - Platform-as-a-Service - A cloud service that abstracts away the infrastructure (users don’t get to see the computers, loadbalancers, etc.) but rather provides a software development platform. It is possible to code and run an application on a PaaS and the system makes sure that the app has the necessary infrastructure to make it run and scale.

SaaS - Software-as-a-Service - A cloud service that provides users access to software in a self-service, on-demand fashion. This could be a single application or provide a catalog of software a user might choose from..

1.1.1 Characteristics

A programmable elastic infrastructure exhibits the following characteristics:

It should adapt to changing business needs based on automated practices that use data to analyze and then program instances that auto-scale with expected increases or decreases in demand.
It is “cost aware architecture,” meaning that the infrastructure drives application development, as opposed to the other way around. Embodied in this is the increasing requirement for the applications to be controllable, resilient, adaptive and data driven.
Should be a collection of lego blocks of small decomposable blocks that can be decoupled from the infrastructure.
Categorize. infrastructure blocks in their manner of allocations. For example: Spot Instances, Reserved Instances and Standard instances, based on the need of elasticity.
Discover Servers. Infrastructure should maintain Host information database that can be used to build automation and monitoring services for infrastructure. It should try to solve the problem of finding hosts (Physical and Virtual) and their purposes in a large environment. For example: HostDB³ ** Infrastructure should be able to maintain the state of resources and enable developers to manage them. Automation tools should help deploy faster, with greater reliability,. For example: Puppet⁴.
Elastic Infrastructure should allow to span across different underlying cloud platforms as may be required to utilize unique services available from different platform or for purposes of scaling, availability.⁵

1.1.2 Components of Elastic Infrastructure

Following are the components of elastic infrastructure are:

Self Service API - It is a set of APIs which are used to interface with elastic infrastructure for provisioning of elastic resources (VMs, Storage, Network).
Metering and Billing - It accounts for the usage of the elastic resources in duration and capacity utilized.
Resource Management - It manages (add/delete/modify) available resources for elastic infrastructure and their allocation/de-allocation for elastic resources).
Image Storage - This component provides for various OS + Software packages bundled together as images. This is based on what is generally used by developer to host their applications.
Virtual Servers - This component is has the capability to spin off compute instances based on requested configuration(RAM, Cores)
Network Functions - This component provides for the functionalities of DNS, Load Balancing, NATing.
Monitoring and Operations - This component provides for the monitoring of the underlying hardware/storage/network capabilities and the virtual instances created on them.

1.1.3 References

1.1.3.1 Videos

- The aim of this talk is to describe the current best practices and software to use when creating an auto scaling infrastructure.
- It showcases an elastic infrastructure that is based on proven methods and open source software (Flipkart Hostdb and Puppet) which enables building a platform that in turn allows application engineers to create massively scalable web apps without losing sleep.
Netflix Containers

1.2 Continuous Delivery (CD) Pipeline

1.2.1 What is Continuous Delivery⁶

Continuous delivery is a DevOps software development practice where code changes are automatically built, tested, and prepared for a release to production. It expands upon continuous integration by deploying all code changes to a testing environment and/or a production environment after the build stage. When continuous delivery is implemented properly, developers will always have a deployment-ready build artifact that has passed through a standardized test process.

With continuous delivery, every code change is built, tested, and then pushed to a non-production testing or staging environment. There can be multiple, parallel test stages before a production deployment. In the last step, the developer approves the update to production when they are ready. This is different from continuous deployment, where the push to production happens automatically without explicit approval.

Continuous delivery lets developers automate testing beyond just unit tests so they can verify application updates across multiple dimensions before deploying to customers. These tests may include UI testing, load testing, integration testing, API reliability testing, etc. This helps developers more thoroughly validate updates and pre-emptively discover issues. Elastic Infrastructure shall facilitate easy and cost-effective to automate the creation and replication of multiple environments for testing, which is otherwise difficult to do on-premises. A diagram is shown below to capture the linkages between the three process work-flows.

1.2.2 Open source tools⁷:

Tool Name	Summary
Ansible	Ansible is a radically simple IT automation engine that automates cloud provisioning, configuration management, application deployment, and intra-service orchestration. It models the IT infrastructure by describing how all systems inter-relate, rather than just managing one system at a time.
Chef	Chef is an open source software agent that automates infrastructure by turning it into code. The infrastructure becomes dynamic, versionable, human-readable, and testable, regardless of whether it’s deployed in the cloud, on-premises, or in a hybrid environment.
Puppet	Puppet is an open source configuration management tool that provides a standard way of delivering and operating software, no matter where it runs. It defines apps and infrastructure using a common easy-to-read language.

1.2.3 Benefits

1.2.4 References

1.2.4.1 Websites

What is continous delivery pipeline
Continuous Delivery Pipeline
Continous Delivery versus Continuous deployment
Understand build continous delivery pipeline
5 Traits of good delivery pipeline
Continuous Delivery Pipeline
Automate any app, anywhere with Habitat
- Habitat centers application configuration, management, and behavior around the application itself, not the infrastructure that the app runs on. This allows Habitat to be deployed and run on various infrastructure environments, such as bare metal, VM, containers, and PaaS.
An executive’s guide to software development
- McKinsey page to describe checklist for development practices today For related topics (Feature toggles, releases)
Feature Toggles by Martin Fowler
Runtime toggles from TW Website
Continuous Deployment
Decoupling deployment from release
Canary release
Deployment production line

1.2.4.2 White papers

1.2.4.3 Videos

[Continuous Delivery Pipeline]

1.3 Security

Security is an important function of an organization and success of the business critically depends on it. The reactive approach towards managing security is gradually being replaced by a proactive approach. IT infrastructure is increasingly becoming more complex and diverse, while the exposure of an organization to the security threats is expanding. This necessitates planning and deployment of security countermeasures across all layers of the infrastructure, including network, server systems, endpoints, application infrastructure, messaging, database, etc. The capability of the countermeasures depends on various factors such as the architectural positioning in the IT ecosystem and manageability of solutions. Maturity of the security operations around these countermeasures, monitoring and testing efforts deployed for assessing their capabilities and their integration with the incident management mechanism contribute to the overall effectiveness of the countermeasures.

Moreover, the governance culture of an organization becomes an important element to ensure that an individual countermeasure is derived from a well devised plan; the operations around it is executed as per the intended purpose and the performance of the countermeasure is continuously monitored. The strategy for security is a proactive initiative to devise a defense plan of an organization against the evolving security threats, which addresses multiple dimensions for structured, effective and efficient defense. Security strategy, thus, brings a structure to security initiatives that strive to position the countermeasures for effective protection. It establishes a structured process to understand the threat landscape of an organization and solve the complexity of IT infrastructure to provide better options to improve the security posture. It provides different ways for better deployment and management of the security countermeasures; it allows building the operational strategies for optimizing the resources and efforts. Security architecture remains an important instrument requirement that are necessary for assured and intended functioning of the countermeasures.

Security policy is also an instrument for the executive management to articulate their commitment and intent for protecting the information assets. It serves as a tool to ensure compliance, provide assurance to stakeholders, and provide direction to the security initiatives of an organization. The security policy should be an aggregated reflection of the policy requirements that are necessary for assured and intended functioning of the countermeasures. Security policy is also an instrument for the executive management to articulate their commitment and intent for protecting the information assets. It serves as a tool to ensure compliance, provide assurance to stakeholders, and provide direction to the security initiatives of an organization.

1.3.1 Framework⁸

Given below is framework for secure cloud computing. It consists of three essential security components; each of them includes important challenges related to cloud security and privacy. These components are:

Security and privacy requirements: identifies security and privacy requirements for the cloud such as authentication, authorization, integrity, etc.
Attacks and threats: warns from different types of attacks and threats to which clouds are vulnerable.
Concerns and risks: pay attention to risks and concerns about cloud computing. Each of these guideline are explained briefly in the following sub-sections.

1.3.1.1 Security and Privacy requirements

Authentication and user identification are usually accomplished by employing usernames and passwords when using web browser or mobile to access services. A more efficient way for authentication is to use an additional authentication factor such as dynamic token.

Authorization and access control: The challenge here is how to control access priorities, permissions and resource ownerships of authenticated users on the cloud. It should be resolved through separation of duties, to ensure that activities of privileged customers are monitored by the staff, and gathering enough information on administrators who are allowed to access customers’ information.

Confidentiality: Users Data confidentiality could be threatened on the cloud due to multitenancy, data remanence, week user authentication and software applications. It should be ensured that software applications interacting with customer’s data do not introduce additional confidentiality breaches and threats and securely handle and maintain this data. Encryption algorithms and advanced electronic authentication methods such as 2FA are common techniques to achieve confidentiality.

Integrity: Integrity includes data accuracy, completeness and ensures Atomicity, Consistency, Isolation and Durability (ACID). Platform should maintain data integrity by preventing unauthorized access. Hash function algorithms are used widely to achieve data integrity.

Non-repudiation: ensures that the sender of a message cannot deny the message was sent and that the recipient cannot deny the message was received. This can be achieved using techniques such as digital signatures, timestamps and confirmation receipt services.

Availability: refers to cloud data, software and also hardware being available, usable and accessible to authorized users upon demand. Availability is an important factor in choosing among various CSPs. It is also essential to ensure safety of enterprise data, minimal downtime, Business Continuity (BC) and Disaster Recovery (DR). Availability can be achieved using backup and recovery schemes, fault tolerance, and replication techniques.

Compliance and Audit: Organizations must comply with regulations and laws, using set of audits and with different standards such as SAS 70, ISO 27001, Health Insurance Portability and Accountability Act (HIPAA) and Payment Card Industry Data Security Standard (PCIDSS).

Transparency: the operation of the cloud should be sufficiently clear to users and CSPs. Users must be able to get a clear overview of where and how their data will be handled. SLA is considered as one of the most important protocols to ensure transparency since it is the only legal agreement between CSPs and customers that contains guidelines to customers such as: service to be delivered, tracking and reporting, legal compliance, and security responsibility.

Governance ensures protecting data against various malicious activities and helps control cloud operations

Accountability: implies that security and privacy gaps are correctly addressed

1.3.1.2 Attacks and Threats

Wrapping attacks: these attacks occur between the web browser and the server by altering the Simple Object Access Protocol (SOAP) messages for two persons, the user and the attacker. When using XML signatures for authentication or integrity, the most well-known attack is XML Signature Element Wrapping.

Browser-based attacks: a browser attack alters the signature and encryption of SOAP messages. The security of Web browsers is defended against some types of attack such as phishing attack, SSL certificate spoofing, and attacks on browser caches.

Metadata spoofing attacks: include re-engineering Web Services’ metadata descriptions. To defend against this threat, verification techniques should be used.

Cloud injection attacks: attempt to create malicious service implementation modules or virtual machine instances for the opponent to be executed against intention. Examples for these modules are SQL injection, OS command injection and cross site scripting. To avoid this attack, a hashing algorithm should be used.

Denial of service attacks: occur when an attacker sends a lot of malicious requests to the server and consumes its available resources, CPU and memory. When the server reaches its maximum capacity, it offloads the received requests to another server. The cloud system works against the attacker by providing more computational power.

Buffer overflow attacks: when buffer overflow occurs, the attacker is able to overwrite data specialist in program execution to execute his malicious program.

Privilege escalation: utilizes a vulnerability that comes from any programming errors and aims to access the protected resources without permission.

Abuse and Nefarious Use of Cloud Computing: a cloud is a relatively open environment; consumers from everywhere can easily register to utilize its costeffective services with simply a valid credit card. Examples of such attacks are password and key cracking, DDoS, launching dynamic attack points, hosting malicious data and botnet command and control. Suggested remedies to mitigate this threat are: improving credit card fraud detection, applying strict registration and validation rules and performing extensive examination of network traffic.

Insecure Interfaces and APIs: cloud consumers interact with cloud through a set of user interfaces and APIs provided by service providers. Suggested remedies to mitigate this threat are: analysing the security model of the API, employing strong authentication, access control and encryption techniques and understanding the dependency chain of the API.

Malicious Insiders: this threat is caused by employees hired by service providers. Those employees are granted a level of access that may allow them acquire confidential data and fully control cloud services without being detected. Suggested remedies to mitigate this threat are: requiring transparency in all information security issues, defining security breach notification processes and enforcing strict hiring requirements and human resource assessment.

1.3.1.3 Concerns and Risks

Access control: how can cloud users govern access control risks when the levels and types of access control used by cloud providers are unknown?

Monitoring: how can accurate, timely and effective monitoring of security and privacy levels achieved in business-critical infrastructure when its providers are not prepared to share such information at SLA?

Applications development: how to accomplish application development and maintenance in the cloud when CSPs are responsible to?

Encryption: how can the cloud user manage encryption and assign responsibilities across the borders between the cloud service providers and his organization?

Data retentivity: how can the cloud user achieve appropriate confidence that the data have been actually and securely removed from the system by the cloud provider and are not merely made inaccessible to him?

Testing: how can consumers test the effectiveness of security control when these tests may not be made available by CSPs?

1.3.2 References:

1.3.2.1 Websites:

1.4 Deployment Runtime ⁹

Runtime is when a program is running (or being executable). That is, when we start a program running in a computer, it is runtime for that program. In some programming languages, certain reusable programs or “routines” are built and packaged as a “runtime library.” These routines can be linked to and used by any program when it is running.

Today, every developer spends time in setting up his/her environment to run the software that he/she writes. In this process, they may have differences of library versions, compilers/linkers, pre/post-processors being used, which can lead to integration issues later. Also, there are pretty good chances of moving to latest releases of third party software packages, which may not have been evaluated for stability, security and compatibility. This leads to integration conflicts at deployment stage. For example - standardized monitoring program supports ‘x’ version of a programming language to be used, whereas for leveraging one of the advanced features in ‘x+1’ version of the programming language is used by developer. This leads to a situation where the software so released, won’t be able to integrate with monitoring program and break one of the important elements of continuous deployments. Similarly, it can happen for versions incompatibility with test automation software. It leads to wastage of precious time of developers and deployment engineers.

Deployment runtime are the environment that application would request from platform for itself to run. These would typically map to getting compiler(go, C/C++ etc), interpreters (nodejs, python etc), runtime libraries/packages/modules (http, parsers) which are needed to deliver service. These runtimes represent collection of specific versions of each of the software components which are tried, tested and certified to work in enterprise applications. It can save from repeated application assessment activities.

With runtime, being readily available, developer is able to get the app up and running quickly, with no need to to repetitive tasks to set up and manage VMs and operating systems.

1.4.1 References

1.4.1.1 Whitepapers

1.5 Monitoring ¹⁰

1.5.1 Why monitoring?

To achieve satisfactory quality of service, IT infrastructure and operations (I&O) professional usually monitor what they believe to be the weakest service delivery infrastructure links. Over time, the focal point has moved from networks to systems to application code. But the current complexity of business services is such that issues can spring from anywhere in the service delivery chain.

1.5.1.1 Monitoring landscape

From Forrest survey and research on this subject in North American firms, following are the findings :

Service outage events can cost companies millions per year. In evolving organizations, many times the time taken by IT to restore and outage is in hours causing customer attrition and bad feedback on public forums. This cost is significant in business depending on the scale of business.
Identification of failed service delivery component is the most important issue in performance management. As applications and business services increase in complexity, the key to reducing the time to resolution of a problem hinges critically on pro-actively detecting service degradation and a rapid triage to identify its origin.
Organization should take bottom-up approach to service management Tools like Network performance management (NPM), application performance management (APM) and log data analytics to detect alert, and help resolve performance and availability issues. But these tools are acquired in an ad hoc fashion to address problems that occurred in the past, not as the result of strategic planning. Solving specific issues results in bottom-up approach, generating disparate data, which is not able to triage complex service delivery issues. MTTR remains high and poorer insight into the interrelationships and dependencies across service delivery components.
Infrastructure and operations decision-makers see value in a top-down approach As a result of an end-to-end monitoring and analysis solutions, metrics like time to issue identification, time to issue resolution show significant improvements. This top-down service triage methodology relies on a consistent and cohesive set of data that provides a meaningful and contextual view of all interrelationships and dependencies across service delivery components.

1.5.1.2 Availability and Performance issues beyond Source

There are several sources of error that can cause service outages. Diagram below shares that hardware issues form the major share of these errors.

1.5.1.3 Major sources of poor response time

Insufficient cooperation between different teams. A major issue is lack of cooperation between various teams that manage different aspects of infrastructure, applications, or business services. As these teams use different, specialized, and silo-specific tools, they don’t collect information from the same perspective and data sometimes contradicts each other, thereby hindering cooperation.
The difficulty to be proactive. It is extremely difficult to proactively manage service performance through traditional means, since there is no global visibility into the composite n-tier service delivery infrastructure. This difficulty is compounded by IT’s use of virtualization and internal cloud technologies, whose virtual internal clocks complicate the use of traditional monitoring agents

1.5.2 Monitoring areas

Broadly monitoring should integrated into following three verticals for delivery infrastructure:

Application monitoring
- Runtime environment monitoring
- Messages monitoring
- Portal/Webserver monitoring
- Application monitoring
Database monitoring
- Database performance analytics
- Database performance monitoring
Infrastructure monitoring
- Network and server monitoring
- Virtual world monitoring

1.5.3 Improvement areas

It is important to integrate monitoring capabilities to achieve effective monitoring. These are identified below:

Problem Identification capability
End to end monitoring
Alerting capability
Analysis of problem identification
Employees skill set
Component analysis (Application code monitoring)
Incident management process
Integration of alerts from disparate monitoring tools