3. Self Service Data

3.1 Introduction

Today’s businesses rely on real-time analytics to power decisions. Data and application architecture are increasingly merging as we create more real-time, data-enabled business capabilities, while supporting advanced data science-driven analytics. Application developers need to focus on contributing to the organizational data stream and selectively tapping into it. The data platform is therefore becoming ubiquitous. It needs to reduce friction by streamlining the use, management, and operations of complex data technologies like Kafka, Hadoop, and Spark. It should also expose a self-service ecosystem of curated data assets, addressing data strategy issues such as data lake design, data ownership, and authorization issues.

Self Service Data 5

Designed and implemented correctly, self-service data analytics delivers the following capabilities:

Self Service Data Key Outcomes

3.1.1 What Self-Service Data is NOT?

THE ROLE OF SELF-SERVICE DATA ANALYTICS IS OFTEN MISUNDERSTOOD. IT IS :

  • Not designed to replace enterprise ETL platforms, but rather complement them for analytic insights delivered at the speed of business needs

  • Not visual analytics, but a conduit for accessing, cleansing, and blending data from multiple sources; performing advanced analytics; and sharing insights at scale

  • Not a niche tool for predictive or spatial analytics requiring deep knowledge of statistics, R, or GIS tools, but an intuitive platform that empowers users of all skill levels

  • Not a contributor to the rise of data chaos creating shadow IT projects, but a platform that respects data governance and ensures that only the right users have permission to access the right data.

3.1.2 Self-Service Data Preparation

Analytics users spend the majority of their time either preparing their data for analysis or waiting for data to be prepared for them. Self-service data preparation is an iterative agile process for exploring, combining, cleaning and transforming raw data into curated datasets for data science, data discovery, and BI and analytics.

Overview of Self-Service Data Preparation

Business demands for faster and deeper insights from a broader range of data sources have driven the rapid growth of modern BI and analytics (BI&A) platforms, including more-pervasive self-service, and expanded adoption of advanced analytics platforms used by both specialist and citizen data scientists. However, as more and more users want to rapidly analyze complex combinations of data (for example, internal transaction and clickstream data, with weather or other open data or premium datasets), existing data integration approaches are often too rigid and time-consuming to keep up with demand. The escalating challenges associated with larger and more-diverse data are further contributing to the demand for adaptive and easy-to-use data preparation tools. Some of these challenges include:

  • Inability to derive value from data due to a lack of understanding of the data.
  • A lack of trust in the data due to data quality issues and shortage of necessary metadata.
  • The variety of data formats, which impedes timely exploration.
  • Inability to share data due to the potential risk of identifying personal and sensitive information buried in the data assets.
  • An increasing amount of data from unknown origin.
  • More data coming from a broader set of sources inside and outside of the organization, including open data and third-party sources from data brokers.
  • The greater challenge of governance of trusted, accurate results, as distributed analytics content authoring by analysts, data scientists and citizen data scientists expands in the enterprise.

As organizations are accelerating their plans to become more agile and flexible, the need to quickly prepare, explore and garner operational insights faster has become a key imperative. These challenges have made data preparation one of the biggest roadblocks to pervasive and trusted modern analytics.

Self-service data analytics is typically divided into the following 5 key areas:

  1. 3.2 Data Pipeline
  2. 3.3 Realtime Architecture
  3. 3.4 Data Lake Design
  4. 3.5 Data as a Product
  5. 3.6 Granular Authorization

3.2 Data pipeline

A data pipeline is a unified system for capturing events for analysis and building products. The following capabilities are needed as part of data pipeline for self-service data analytics system

  1. Data exploration and profiling: A visual environment that enables users to interactively prepare, search, sample, profile, catalog and inventory data assets, as well as tag and annotate data for future exploration. Advanced features include autoinference, discovering and suggesting sensitive attributes, identifying commonly used attributes (for example, geodata, product ID), doing semantic reconciliation, discovering and recording data lineage of transformations, and autorecommending sources to enrich the data.
  2. Collaboration: Facilitates the sharing of queries and datasets, including publishing, sharing and promoting models with governance features, such as dataset user ratings or official watermarking.
  3. Data transformation, blending and modeling: Supports data enrichment, data mashup and blending, data cleansing, filtering, and user-defined calculations, groups and hierarchies. This includes agile data modeling/structuring that allows users to specify data types and relationships. More-advanced capabilities automatically deduce or infer the structure from the data source, and generate semantic models and ontologies, such as logical data models and hive schemas.
  4. Data curation and governance: Supports workflow for data stewardship and capabilities for data encryption, user permissions and data lineage. This also includes security features that enable governance, such as data masking, platform authentication and security filtering at the user/group/role level, as well as through integration with corporate LDAP and/or Activity Directory systems, SSO, source system security inheritance, row- and column-level security, and logging and monitoring of data usage and assets.
  5. Metadata repository and cataloging: Supports creating and searching metadata, cataloging of data sources, transformations, user activity against the data source, data source attributes, data lineage and relationships, and APIs to enable access to the metadata catalog for auditing or other uses. Through the use of analytics on the raw data, the models are derived and generated bottom up instead of designed top down. It is a continuous process of accumulating metadata based on the actual use of data. It is a living construct. This is the key difference from the ontologies and enterprise data models of the 1980s and 1990s, which were too holistic and complex to centrally design upfront. This is also a difference from data warehouse automation tools, such as Kalido and WhereScape, which automate the data warehouse development process and life cycle.
  6. Machine learning: Use of machine learning and artificial intelligence (AI) to automate and improve the self-service data preparation process.
  7. Deployment models: Platforms can be deployed either in the cloud, on-premises, or across both cloud and on-premises. This latter hybrid approach allows users to leave data on-premises in place for processing, rather than moving it to the self-service data preparation platform either in the cloud or on-premises.
  8. Domain- or vertical-specific offerings or templates: Packaged templates or offerings for domain- or vertical-specific data and models that can further accelerate time to data preparation and insight. This is particularly helpful for a number of difficult-to-use syndicated datasets.
  9. Data source access and connectivity: APIs and standards-based connectivity, including native access to cloud application and data sources, enterprise on-premises data sources, relational and unstructured data, NoSQL, Hadoop, and various file formats (XML, JSON, .csv), as well as native access to open, premium or curated data.
  10. Integration with BI&A and advanced analytics platforms: The ability to integrate harmonized, curated datasets with BI&A and advanced analytics platforms through APIs or native support for partner file formats (for example, .tde for Tableau Software, .qvd for Qlik and .pbi for Microsoft Power BI).

3.2.1 Netflix's Data Pipeline - KEYSTONE

One example of a data pipeline design is Keystone used by Netflix. It is an architecture similar to lambda architecture.

Netflix's Data Pipeline

Fig: Keystone - Netflix's data pipeline

It is also a Kafka fronted data pipeline. It contains two Kafka, fronting Kafka and secondary Kafka. The fronting Kafka forwards the events to the routing service. The routing service is responsible for moving data from fronting Kafka to various sinks: S3, Elasticsearch, and secondary Kafka, which forwards it for data processing. It can handle upto -

  • 500 billion events and 1.3 PB per day
  • 8 million events and 24 GB per second during peak hours.

3.2.2 References for further study

3.2.2.1 Websites

3.2.2.2 Videos

The speaker, Joe Croback, speaks about how a data pipeline is created and explains different components of a data pipeline.

How Netflix implemented their data pipeline (Keystone) -


Books and papers

References:

  1. Gartner Market Guide for Self-Service Data Preparation Published: 25 August 2016 ID: G00304870
  2. The Unified Logging Infrastructure for Data Analytics at Twitter
  3. Big Data: Principles and Best Practices of Scalable Real-time Data Systems by James Warren and Nathan Marz
  4. Building Real-Time Data Pipelines Unifying Applications and Analytics with In-Memory Architectures
  5. Modern Data Pipelines
  6. The Data Pipeline

3.3 Realtime architectures

A real-time data platform is designed to anticipate change; new technologies underpin the architecture to enable the agile development of dynamically changing data requirements. The next-generation real-time data management platform needs:

An in-memory data platform to deliver data at the speed of thought. Firms need in-memory data to support the new generation of business applications; it’s critical to enabling real-time data access, processing big data quickly, offering new customer experiences, and serving customers in all of their mobile moments. Data stored in-memory can be accessed in orders of magnitude faster than that stored on traditional disks. Top vendors that support in-memory technologies for customer data management platforms include GigaSpaces, IBM, Oracle, SAP, SAS, Software AG, and VMware.

Data virtualization to enable real-time integration of disparate sources. Data virtualization integrates disparate data sources in real time or near-real time to meet demands for analytics and transactional data. It integrates internal and external data sources such as Hadoop, NoSQL, and enterprise DW platforms; packaged, custom, mainframe, and legacy apps; and social platforms.

Hadoop to store and process large data sets. Hadoop, an open source initiative under version 2.0 of the Apache license, delivers a distributed and scalable data processing platform to support big data. It supports the batch processing of analytics by parallel-processing very large sets of data, which can run into the hundreds of terabytes or even petabytes, using clusters on commodity servers.

The integration of disparate big data sources. Big data integration delivers a comprehensive, unified view of the business and its customers, employees, and products. Apache projects such as Camel, Flume, Forrest, HCatalog, Pig, and Sqoop help with big data integration based upon open source projects. In addition, traditional data integration vendors such as IBM, Informatica, Pentaho, SAP, SAS, and Talend extend their existing data integration platforms to support big data sources.

A semantic layer to maintain the context and business language of data. Semantic technologies shape and orchestrate data, continuously remodeling master data and metadata for relevant customer and business views. Graph and triple-store technologies from MarkLogic and Neo Technology (Neo4j) and open source projects such as Apache Giraph and GitHub Titan maintain data relationships. Data profiling tools such as Cambridge Semantics help explore and model data semantically.

Self Service Data Realtime Architectures

3.3.1 Strike The Right Balance Between Agile Development And Architecture

The Agile development trend has reached data management, but not without stumbles. For example, a leading manufacturer of agricultural equipment transitioned from waterfall development to Agile development for BI. For too long, tech management’s response to business data needs was “we’ll slot you in.” The CIO’s organization favored large architectural investments over time-sensitive needs. However, the initial transition to Agile had the effect of pushing off much-needed architectural building blocks and data governance requirements that were a drag on the value delivered. In the end, the tech management team had to adapt Agile development processes to account for both ad hoc and architectural requirements to again lower time-to-value for tactical projects and deliver faster on foundational platform investments.

3.3.2 Avoid “demographic” road maps that are an all-or-nothing proposition.

Typical data management road maps emphasize who does what to which system. When the main goal of the project is the creation of a data platform, resources are overwhelmed and technical debt accrues. The hidden cost of thinking of the new solution as the final road map milestone is that data remains in silos and business insights remain inconsistent.

ssd_realtime_architectures_2

3.3.3 Adopt value-based road maps to satisfy immediate and future business gains.

One global airline is taking a different approach to data investment. The head of customer intelligence and optimization established a framework that assessed how the type of analysis influenced the value realized and evaluated the impact of the analytics capabilities’ level of complexity on the insights received. The goal: to develop technical data capabilities that deliver more valuable insights faster while reducing the complexity needed for the team to gain those insights. This gives the customer data management team a better idea of how to prioritize investment and development and align operationally and strategically. If the strategy and road map don’t meet the customer intelligence and optimization team’s expectations, the project is further evaluated or scrapped before it starts.

ssd_realtime_architectures_3

3.3.4 Be Sure Your Real-Time Architecture Road Map Fulfills Five Design Principles

Data management frameworks have existed for decades but need to change to enable real-time data, real-time integration, on-demand scale, predictive analytics, and self-service data platforms. To define a data management road map, enterprise architects must:

Understand their business data. Most of the organizations struggle to know what data they have, where it’s located, where it came from, how it’s managed, what dependencies it has, and how it integrates with other systems. So start by tackling slow-moving data management frameworks to understand their metadata, interfaces, and application requirements and modernize them by moving them with new data management technologies such as in-memory, Hadoop, NoSQL, and data virtualization.

Separate applications from the data management tier. Applications should focus only on the application logic and user interface, not on data management or data integration functions. Decoupling the two helps move to a real-time data management platform faster and with less effort. Applications should focus on making generic data access calls to retrieve data from the real-time data management platform rather than hard-coding data access to diverse data sets. Consider deploying all new applications with such decoupled architecture.

Use distributed in-memory technology for performance and scale. Look at using distributed in-memory to achieve extreme high performance and scale for applications that need real-time data or faster access to critical data. Use memory cache across physical servers to support distributed scale-out. Focus on distributed in-memory that supports all kinds of data — structured, semistructured, and unstructured — and offers a unified scale-out cache. Supplement the cache with disk-based protection for persistence to support data recoverability and long-term retention.

Consider vendor solutions that help achieve faster time-to-value. Data management solutions can help reduce time-to-value by automating and simplifying various data management functions and implementation steps. Look at those that support broader solutions and can support your business data and applications. Ask your vendor how it plans to provide the real-time data management vision; review the various components that the vendor has integrated and ask how it plans to fill any gaps.

Leverage as-a-service offerings to lower platform/tools costs and scale dimensionally. Cloud is increasingly becoming a strategic launch pad for data capabilities, not just a lower-cost storage environment. Cloud is already a backbone for retailers, which use it for commerce and advertising to create best-in-class customer experiences and engagement, but now, companies that traditionally have lagged in direct buyer engagement are using this model as a template to jump-start their data management competencies. The automotive, durable goods, pharmaceutical, and consumer packaged goods verticals can buy data management and platform-as-a-service to scale out data services and customer intelligence across an omnichannel landscape in a fraction of the time and cost to build the capability on-premises.

3.4 Data Lake Design

3.4.1 What is Data Lake

Data Lake is an object based repository that stores large amount of raw data right from structure, semi-structured and unstructured data in its native format. Data can flow into the Data Lake by either batch processing or real-time processing of streaming data. As opposed to a data warehouse that stores data in files or folders, Data Lake is marked by its horizontal flat architecture where each data element is assigned a unique identifier and is tagged with extended meta tags.

Data lakes usually align with an “ELT” strategy which means we can Extract and Load into the data lake in its original format, then Transform later if a need presents itself.

Data Lake

Fig: Data Lake

Some of the benefits of a data lake include-

  • Ability to derive value from unlimited types of data.
  • Ability to store all types of structured and unstructured data in a data lake, from CRM data to social media posts.
  • More flexibility—you don't have to have all the answers up front.
  • Ability to store raw data—you can refine it as your understanding and insight improves.
  • Unlimited ways to query the data.
  • Application of a variety of tools to gain insight into what the data means.
  • Elimination of data silos.
  • Democratized access to data via a single, unified view of data across the organization when using an effective data management platform.

But without the right storage infrastructure, data management, architecture, and personnel, a data lake can become a data swamp.

All the transactional data, logs and events in its raw form constitute the data lake. It supports the concept of real-time archiving.

3.4.2 Components of a data lake

  • API & UI - An API and user interface that expose these features to internal and external users.
  • Entitlements - A robust set of security controls - governance through technology, not policy.
  • Catalogue & Search - A search index and workflow which enables data discovery.
  • Collect & Store - A foundation of highly durable data storage and streaming of any type of data.

3.4.3 Data Lake - Hadoop As The Storage

While Data Lake can definitely be aligned with other relational database architecture, its gained popularity with Hadoop primarily because Hadoop is an open source platform and according to Hadoop adopters, it provides a less expensive repository for analytics data. A Hadoop Data Lake architecture can also be used to complement an entire data warehouse rather than replacing it.

Data Lake Hadoop

Fig: Data Lake in Hadoop

3.4.5 References for further study

3.4.5.1 Website

About Data Lake

About Data Lake Design

3.4.5.2 Whitepapers

3.4.5.3 Videos -

3.4.5.4 Gurus and Blogs

3.5 Data As A Product

3.5.1 What makes a data product?

Data products are the reason data scientists are lately treated like rockstars. They incorporate data science into the operation of a product or service, using data in smart ways to provide value. It’s more than just analysis: it’s putting insight into production. Every day we use the archetypal data product, Google search, and our every interaction with the service makes it better. Another famous early data product is LinkedIn’s “people you may know” feature, helping you locate people in your social networks.

Data products can input data from their own usage to improve

By observing how users interact with your product, you can learn a lot. Through instrumenting the user interface, analyzing logs, or other ways of deriving data from users, you can gain extra signals that help improve your data modeling.

Data products are bootstrapped and then evolve

Good data products are rarely “done”—through usage and continued investigation, you start to understand better the problem that you’re trying to solve. One of the characteristics of working with data is that it’s best to work in an agile way: often you don’t even know the right question to ask until you’ve explored the problem space. Get a product in use early, then learn, adapt, and evolve the product.

Data products are best built with nimble, multifunctional teams

The rapid cycle of product evolution is best served by a multifunctional team of data scientists, engineers, product managers, and architects. To move fast with data, data scientists need to get the data from engineers, and insights and discoveries from the data science informs product direction. If these people are in disconnected departments, product development moves slowly and can be defeated by poor communication.

Multi-source data, because “GIGO” still applies

Every student learns that “garbage-in, garbage-out” is true of computer systems, and data products aren’t any different. If you don’t have good data going in, you won’t get a good result. However, that doesn’t mean you throw weak data away. Instead, by using as many diverse data sources as possible, you can create models that are robust in the face of any of the individual sources failing or being erroneous.

Data products can learn things from a system that’s otherwise closed

One of the most exciting aspects of data science is that we can use observed data signals to predict the behavior of a system that we can’t directly access or comprehend. Without understanding the semantic import of every web page, search engines can still figure out which is most useful. If you can use data, you can crack pretty much any problem area you want. That’s why Silicon Valley companies are challenging the grocery, taxi, and entertainment industries, to name just a few.

Data products solve a real problem that people have

Technology is important, for sure. It can often make new things possible, and transform whole industries. But for successful products and companies, it’s always the problem that comes first. A great data product focuses relentlessly on solving the problem that the user has, using whatever data and techniques will help. With today’s proliferating options of platforms, tools, and languages, the only practical way to navigate these options is with a laser focus on how they can help solve the human and business problems at hand.

3.6 Granular Authorization

Granularity in authorization means the level of details used to put on authorization rules for evaluating a decision to grant or deny the access. If authorization rule for a resource access as per business need is just based on a particular check(like associated roles), then it is coarse. If business needs require more details regarding end user/actor, current environment conditions (time, date) etc. to grant the access then it is more granular and fine grained authorization.

3.6.1 Attribute Based Access Control

Attribute-based access control (ABAC) defines an access control paradigm whereby access rights are granted to users through the use of policies which combine attributes together. It uses attributes as building blocks in a structured language that defines access control rules and describes access requests. Attributes are sets of labels or properties that can be used to describe all the entities that must be considered for authorization purposes. The A.B-A.C project has designed and implemented tools for using Attribute-Based Access Control, a scalable authorization system based on formal logic. It maps principals to attributes and uses the attribute to make an authorization decision, e.g., if user1 has the login attribute the login program will alllow them to log in. This library, libabac, is a base on which to build those tools.

3.6.2 Role Based Access Control

Role Based Access Control (RBAC) model provides access control based on the position an individual fills in an organization. When properly implemented, RBAC enables users to carry out a wide range of authorized tasks by dynamically regulating their actions according to flexible functions, relationships, and constraints. This is in contrast to conventional methods of access control, which grant or revoke user access on a rigid, object-by-object basis. In RBAC, roles can be easily created, changed, or discontinued as the needs of the enterprise evolve, without having to individually update the privileges for every user.

3.6.3 Extensible Access Control Markup Language (XACML)

XACML (eXtensible Access Control Markup Language) is an open standard which is robust and flexible enough to design policies for authorization independently. The XACML policy language is as expressive as a natural language. XACML is an implementation of the ABAC model. It is the most widely spread implementation of ABAC. OpenAZ, currently in Apache Incubator, provides tools and libraries for ABAC and is implemented on XACML 3.0

3.6.4 References for further studies

3.6.4.1 Website

3.6.4.2 White papers

3.6.4.3 Videos


3.6.4.3 Books and Papers