3. Self Service Data

What data workers and business analysts need is a solution that balances ease of use with the capabilities afforded by traditional analytic tools. Therefore availability of data as a self-service to different stakeholders is key to identify patterns, problems and hence drive to solutions. Analytics stack helps to achieve this data intelligence, however, its stack is resource hungry, requires special skills to find patterns and takes time. This is one such organization capability that if built can keep organization future ready. Platform should connect with various data sources, establishing the data pipeline with realtime data flow, with processing and archiving the relevant and derived information. Developers should connect with these data streams to build insights. This data with intelligence dug out can be offered as a product to various stakeholders in the system. This captures lot of business sensitive information and can reveal lot about customers itself. Platform should easy to use access control.

References

Self Service Data

Fig: Overview of Self Service Data

Self-service data therefore can be broken down into following 5 key areas, which are explained in following sections.

3.1 Data Pipeline

A data pipeline is a unified system for capturing events for analysis and building products. Key components of Data Pipeline -

  • Event Framework
  • Message Bus
  • Data persistence
  • Workflow management
  • Batch processing and ad-hoc analysis

Lambda architecture is one of the most promising deployment method in enterprise's data pipeline nowadays. Another architecture that is oftenly used is the kappa architecture.

3.1.1 Lambda Architecture

Lambda Architecture

Fig: Lambda Architecture

3.1.1.1 Lambda Architecture Stack -

  1. Unified Log – Apache Kafka
  2. Batch Layer – Hadoop for Storage
  3. Serving Layer – MySQL, Cassandra, NoSQL or other KV Stores
  4. Real-Time Layer – Spark Streaming
  5. Visualization Layer

Apache Kafka is an open-source stream processing platform. Kafka acts as a message queue, where all the data from server logs is queued up and is routed to depending on steam or batch processing. Apache Spark is the popular data processing engine which fits the lambda architecture well thanks to its strength on both batch and streaming computation capabilities. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

3.1.2 Kappa Architecture

Kappa Architecture

Fig: Kappa Architecture

Unlike lambda architecture, it contains only two layers, speed or real-time layer and serving layer. There is no batch layer. Apache Flink fits best in this architecture.

So when should we use one architecture or the other? As is often the case, it depends on some characteristics of the application that is to be implemented. Let’s go through a few common examples:

A very simple case to consider is when the algorithms applied to the real-time data and to the historical data are identical. Then it is clearly very beneficial to use the same code base to process historical and real-time data, and therefore to implement the use-case using the Kappa architecture.

Now, the algorithms used to process historical data and real-time data are not always identical. In some cases, the batch algorithm can be optimized thanks to the fact that it has access to the complete historical dataset, and then outperform the implementation of the real-time algorithm. Here, choosing between Lambda and Kappa becomes a choice between favoring batch execution performance over code base simplicity.

Finally, there are even more complex use-cases, in which even the outputs of the real-time and batch algorithm are different. For example, a machine learning application where generation of the batch model requires so much time and resources that the best result achievable in real-time is computing and approximated updates of that model. In such cases, the batch and real-time layers cannot be merged, and the Lambda architecture must be used.

3.1.3 Netflix's Data Pipeline - KEYSTONE

One example of a data pipeline design is Keystone used by Netflix. It is an architecture similar to lambda architecture.

Netflix's Data Pipeline

Fig: Keystone - Netflix's data pipeline

It is also a Kafka fronted data pipeline. It contains two Kafka, fronting Kafka and secondary Kafka. The fronting Kafka forwards the events to the routing service. The routing service is responsible for moving data from fronting Kafka to various sinks: S3, Elasticsearch, and secondary Kafka, which forwards it for data processing. It can handle upto -

  • 500 billion events and 1.3 PB per day
  • 8 million events and 24 GB per second during peak hours.

3.1.4 References for further study

3.1.4.1 Websites

3.1.4.2 Videos

The speaker, Joe Croback, speaks about how a data pipeline is created and explains different components of a data pipeline.

How Netflix implemented their data pipeline (Keystone) -

3.1.4.3 Gurus and Blogs

3.1.4.4 Books and Papers

3.1.4.5 Software and Tools

Apache Kafka - It is an open-source stream processing platform. Apache Spark - It is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Flink - It is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

Top Streaming Technologies

3.2 Realtime Architectures

The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. The reason is that Hadoop framework is based on a simple programming model (MapReduce) and it enables a computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.

As discussed in the above section, two architectures that are used popularly are the lambda and kappa architecture. Lambda architecture uses Apache Spark for data processing whereas the kappa architecture uses the Apache Flink. Both the tools offer high throughput. Apache Flink can process data in realtime whereas Apache Spark is able to process data in near realtime.

3.2.1 Spark Architecture

Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process.

Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only.

Spark Architecture

Fig: Spark Architecture

Spark uses microbatching computational model and has medium latency. It processes the data in near real-time.

Flink Architecture

Fig: Flink Architecture

Another alternative to Apache Spark is Apache Flink. It uses stream processing model and has lower latency. It processes the data in real time because of its true stream processing module.

Flink supports true streaming- both at the API and at the runtime level.

3.2.3 References for further study

3.2.3.1 Websites

3.2.3.2 Videos

3.2.3.3 Software and Tools

Apache Spark - It is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Flink - It is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

3.3 Data Lake Design

3.3.1 What is Data Lake

Data Lake is an object based repository that stores large amount of raw data right from structure, semi-structured and unstructured data in its native format. Data can flow into the Data Lake by either batch processing or real-time processing of streaming data. As opposed to a data warehouse that stores data in files or folders, Data Lake is marked by its horizontal flat architecture where each data element is assigned a unique identifier and is tagged with extended meta tags.

Data lakes usually align with an “ELT” strategy which means we can Extract and Load into the data lake in its original format, then Transform later if a need presents itself.

Data Lake

Fig: Data Lake

Some of the benefits of a data lake include-

  • Ability to derive value from unlimited types of data.
  • Ability to store all types of structured and unstructured data in a data lake, from CRM data to social media posts.
  • More flexibility—you don't have to have all the answers up front.
  • Ability to store raw data—you can refine it as your understanding and insight improves.
  • Unlimited ways to query the data.
  • Application of a variety of tools to gain insight into what the data means.
  • Elimination of data silos.
  • Democratized access to data via a single, unified view of data across the organization when using an effective data management platform.

But without the right storage infrastructure, data management, architecture, and personnel, a data lake can become a data swamp.

All the transactional data, logs and events in its raw form constitute the data lake. It supports the concept of real-time archiving.

3.3.2 Components of a data lake

  • API & UI - An API and user interface that expose these features to internal and external users.
  • Entitlements - A robust set of security controls - governance through technology, not policy.
  • Catalogue & Search - A search index and workflow which enables data discovery.
  • Collect & Store - A foundation of highly durable data storage and streaming of any type of data.

3.3.3 Data Lake - Hadoop As The Storage

While Data Lake can definitely be aligned with other relational database architecture, its gained popularity with Hadoop primarily because Hadoop is an open source platform and according to Hadoop adopters, it provides a less expensive repository for analytics data. A Hadoop Data Lake architecture can also be used to complement an entire data warehouse rather than replacing it.

Data Lake Hadoop

Fig: Data Lake in Hadoop

3.3.4 Data Lake - Amazon S3 As The Storage

Data Lake S3

Fig: Data Lake in AWS

A Data Lake solution on AWS, at its core, leverages Amazon Simple Storage Service (Amazon S3) for secure, cost-effective, durable, and scalable storage. We can quickly and easily collect data into Amazon S3, from a wide variety of sources by using services like AWS Import/Export Snowball or Amazon Kinesis Firehose delivery streams. Amazon S3 also offers an extensive set of features to help us provide strong security for our Data Lake, including access controls & policies, data transfer over SSL, encryption at rest, logging and monitoring, and more. For the management of the data, we can leverage services such as Amazon DynamoDB and Amazon ElasticSearch to catalog and index the data in Amazon S3. Using AWS Lambda functions that are directly triggered by Amazon S3 in response to events such as new data being uploaded, we easily can keep our catalog up-to date. With Amazon API Gateway, we can create an API that acts as a “front door” for applications to access data quickly and securely by authorizing access via AWS Identity and Access Management (IAM) and Amazon Cognito. For analyzing and accessing the data stored in Amazon S3, AWS provides fast access to flexible and low cost services, like Amazon Elastic MapReduce (Amazon EMR), Amazon Redshift, and Amazon Machine Learning, so we can rapidly scale any analytical solution. Example solutions include data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, and internet-of-things processing. By leveraging AWS, we can easily provision exactly the resources and scale we need to power any Big Data applications, meet demand, and improve innovation

3.3.5 References for further study

3.3.5.1 Website

About Data Lake

About Data Lake Design

3.3.5.2 Whitepapers

3.3.5.3 Videos -

3.3.5.4 Gurus and Blogs

3.4 Data As Product

A data product is digital information that can be purchased. Data products turn the data assets a company already owns or can collect into a product designed to help a user solve a specific problem.

3.4.1 Revenue From Data

What many companies have failed to realize is that the raw data they possess could be cleansed, sliced and diced to meet the needs of data buyers. Companies benefit from data products in two ways:

  • Direct revenue: Charging consumers for access to the data and analysis.

  • Indirect revenue: Augmenting existing products or services, driving customer loyalty, generating cost savings or creating revenue through alternate channels. For example, if a user listens to music that features a particular actor, he might also be interested in movies of that actor. So, advertisement of movie of that actor can be shown on the music app.

3.4.2 References for further studies

3.4.2.1 Websites

3.4.2.2 Whitepapers

3.5 Granular Authorization

Granularity in authorization means the level of details used to put on authorization rules for evaluating a decision to grant or deny the access. If authorization rule for a resource access as per business need is just based on a particular check(like associated roles), then it is coarse. If business needs require more details regarding end user/actor, current environment conditions (time, date) etc. to grant the access then it is more granular and fine grained authorization.

3.5.1 Attribute Based Access Control

Attribute-based access control (ABAC) defines an access control paradigm whereby access rights are granted to users through the use of policies which combine attributes together. It uses attributes as building blocks in a structured language that defines access control rules and describes access requests. Attributes are sets of labels or properties that can be used to describe all the entities that must be considered for authorization purposes. The A.B-A.C project has designed and implemented tools for using Attribute-Based Access Control, a scalable authorization system based on formal logic. It maps principals to attributes and uses the attribute to make an authorization decision, e.g., if user1 has the login attribute the login program will alllow them to log in. This library, libabac, is a base on which to build those tools.

3.5.2 Role Based Access Control

Role Based Access Control (RBAC) model provides access control based on the position an individual fills in an organization. When properly implemented, RBAC enables users to carry out a wide range of authorized tasks by dynamically regulating their actions according to flexible functions, relationships, and constraints. This is in contrast to conventional methods of access control, which grant or revoke user access on a rigid, object-by-object basis. In RBAC, roles can be easily created, changed, or discontinued as the needs of the enterprise evolve, without having to individually update the privileges for every user.

3.5.3 Extensible Access Control Markup Language (XACML)

XACML (eXtensible Access Control Markup Language) is an open standard which is robust and flexible enough to design policies for authorization independently. The XACML policy language is as expressive as a natural language. XACML is an implementation of the ABAC model. It is the most widely spread implementation of ABAC. OpenAZ, currently in Apache Incubator, provides tools and libraries for ABAC and is implemented on XACML 3.0

3.5.4 References for further studies

3.5.4.1 Website

3.5.4.2 White papers

3.5.4.3 Videos

3.5.4.3 Books and Papers

  • http://csrc.nist.gov/groups/SNS/rbac/documents/coyne-weil-13.pdf