Today’s businesses rely on real-time analytics to power decisions. Data and application architecture are increasingly merging as we create more real-time, data-enabled business capabilities, while supporting advanced data science-driven analytics. Application developers need to focus on contributing to the organizational data stream and selectively tapping into it. The data platform is therefore becoming ubiquitous. It needs to reduce friction by streamlining the use, management, and operations of complex data technologies like Kafka, Hadoop, and Spark. It should also expose a self-service ecosystem of curated data assets, addressing data strategy issues such as data lake design, data ownership, and authorization issues.
Designed and implemented correctly, self-service data analytics delivers the following capabilities:
THE ROLE OF SELF-SERVICE DATA ANALYTICS IS OFTEN MISUNDERSTOOD. IT IS :
Not designed to replace enterprise ETL platforms, but rather complement them for analytic insights delivered at the speed of business needs
Not visual analytics, but a conduit for accessing, cleansing, and blending data from multiple sources; performing advanced analytics; and sharing insights at scale
Not a niche tool for predictive or spatial analytics requiring deep knowledge of statistics, R, or GIS tools, but an intuitive platform that empowers users of all skill levels
Not a contributor to the rise of data chaos creating shadow IT projects, but a platform that respects data governance and ensures that only the right users have permission to access the right data.
Analytics users spend the majority of their time either preparing their data for analysis or waiting for data to be prepared for them. Self-service data preparation is an iterative agile process for exploring, combining, cleaning and transforming raw data into curated datasets for data science, data discovery, and BI and analytics.
Business demands for faster and deeper insights from a broader range of data sources have driven the rapid growth of modern BI and analytics (BI&A) platforms, including more-pervasive self-service, and expanded adoption of advanced analytics platforms used by both specialist and citizen data scientists. However, as more and more users want to rapidly analyze complex combinations of data (for example, internal transaction and clickstream data, with weather or other open data or premium datasets), existing data integration approaches are often too rigid and time-consuming to keep up with demand. The escalating challenges associated with larger and more-diverse data are further contributing to the demand for adaptive and easy-to-use data preparation tools. Some of these challenges include:
As organizations are accelerating their plans to become more agile and flexible, the need to quickly prepare, explore and garner operational insights faster has become a key imperative. These challenges have made data preparation one of the biggest roadblocks to pervasive and trusted modern analytics.
Self-service data analytics is typically divided into the following 5 key areas:
A data pipeline is a unified system for capturing events for analysis and building products. The following capabilities are needed as part of data pipeline for self-service data analytics system
One example of a data pipeline design is Keystone used by Netflix. It is an architecture similar to lambda architecture.
It is also a Kafka fronted data pipeline. It contains two Kafka, fronting Kafka and secondary Kafka. The fronting Kafka forwards the events to the routing service. The routing service is responsible for moving data from fronting Kafka to various sinks: S3, Elasticsearch, and secondary Kafka, which forwards it for data processing. It can handle upto -
The speaker, Joe Croback, speaks about how a data pipeline is created and explains different components of a data pipeline.
References:
A real-time data platform is designed to anticipate change; new technologies underpin the architecture to enable the agile development of dynamically changing data requirements. The next-generation real-time data management platform needs:
An in-memory data platform to deliver data at the speed of thought. Firms need in-memory data to support the new generation of business applications; it’s critical to enabling real-time data access, processing big data quickly, offering new customer experiences, and serving customers in all of their mobile moments. Data stored in-memory can be accessed in orders of magnitude faster than that stored on traditional disks. Top vendors that support in-memory technologies for customer data management platforms include GigaSpaces, IBM, Oracle, SAP, SAS, Software AG, and VMware.
Data virtualization to enable real-time integration of disparate sources. Data virtualization integrates disparate data sources in real time or near-real time to meet demands for analytics and transactional data. It integrates internal and external data sources such as Hadoop, NoSQL, and enterprise DW platforms; packaged, custom, mainframe, and legacy apps; and social platforms.
Hadoop to store and process large data sets. Hadoop, an open source initiative under version 2.0 of the Apache license, delivers a distributed and scalable data processing platform to support big data. It supports the batch processing of analytics by parallel-processing very large sets of data, which can run into the hundreds of terabytes or even petabytes, using clusters on commodity servers.
The integration of disparate big data sources. Big data integration delivers a comprehensive, unified view of the business and its customers, employees, and products. Apache projects such as Camel, Flume, Forrest, HCatalog, Pig, and Sqoop help with big data integration based upon open source projects. In addition, traditional data integration vendors such as IBM, Informatica, Pentaho, SAP, SAS, and Talend extend their existing data integration platforms to support big data sources.
A semantic layer to maintain the context and business language of data. Semantic technologies shape and orchestrate data, continuously remodeling master data and metadata for relevant customer and business views. Graph and triple-store technologies from MarkLogic and Neo Technology (Neo4j) and open source projects such as Apache Giraph and GitHub Titan maintain data relationships. Data profiling tools such as Cambridge Semantics help explore and model data semantically.
The Agile development trend has reached data management, but not without stumbles. For example, a leading manufacturer of agricultural equipment transitioned from waterfall development to Agile development for BI. For too long, tech management’s response to business data needs was “we’ll slot you in.” The CIO’s organization favored large architectural investments over time-sensitive needs. However, the initial transition to Agile had the effect of pushing off much-needed architectural building blocks and data governance requirements that were a drag on the value delivered. In the end, the tech management team had to adapt Agile development processes to account for both ad hoc and architectural requirements to again lower time-to-value for tactical projects and deliver faster on foundational platform investments.
Typical data management road maps emphasize who does what to which system. When the main goal of the project is the creation of a data platform, resources are overwhelmed and technical debt accrues. The hidden cost of thinking of the new solution as the final road map milestone is that data remains in silos and business insights remain inconsistent.
One global airline is taking a different approach to data investment. The head of customer intelligence and optimization established a framework that assessed how the type of analysis influenced the value realized and evaluated the impact of the analytics capabilities’ level of complexity on the insights received. The goal: to develop technical data capabilities that deliver more valuable insights faster while reducing the complexity needed for the team to gain those insights. This gives the customer data management team a better idea of how to prioritize investment and development and align operationally and strategically. If the strategy and road map don’t meet the customer intelligence and optimization team’s expectations, the project is further evaluated or scrapped before it starts.
Data management frameworks have existed for decades but need to change to enable real-time data, real-time integration, on-demand scale, predictive analytics, and self-service data platforms. To define a data management road map, enterprise architects must:
Understand their business data. Most of the organizations struggle to know what data they have, where it’s located, where it came from, how it’s managed, what dependencies it has, and how it integrates with other systems. So start by tackling slow-moving data management frameworks to understand their metadata, interfaces, and application requirements and modernize them by moving them with new data management technologies such as in-memory, Hadoop, NoSQL, and data virtualization.
Separate applications from the data management tier. Applications should focus only on the application logic and user interface, not on data management or data integration functions. Decoupling the two helps move to a real-time data management platform faster and with less effort. Applications should focus on making generic data access calls to retrieve data from the real-time data management platform rather than hard-coding data access to diverse data sets. Consider deploying all new applications with such decoupled architecture.
Use distributed in-memory technology for performance and scale. Look at using distributed in-memory to achieve extreme high performance and scale for applications that need real-time data or faster access to critical data. Use memory cache across physical servers to support distributed scale-out. Focus on distributed in-memory that supports all kinds of data — structured, semistructured, and unstructured — and offers a unified scale-out cache. Supplement the cache with disk-based protection for persistence to support data recoverability and long-term retention.
Consider vendor solutions that help achieve faster time-to-value. Data management solutions can help reduce time-to-value by automating and simplifying various data management functions and implementation steps. Look at those that support broader solutions and can support your business data and applications. Ask your vendor how it plans to provide the real-time data management vision; review the various components that the vendor has integrated and ask how it plans to fill any gaps.
Leverage as-a-service offerings to lower platform/tools costs and scale dimensionally. Cloud is increasingly becoming a strategic launch pad for data capabilities, not just a lower-cost storage environment. Cloud is already a backbone for retailers, which use it for commerce and advertising to create best-in-class customer experiences and engagement, but now, companies that traditionally have lagged in direct buyer engagement are using this model as a template to jump-start their data management competencies. The automotive, durable goods, pharmaceutical, and consumer packaged goods verticals can buy data management and platform-as-a-service to scale out data services and customer intelligence across an omnichannel landscape in a fraction of the time and cost to build the capability on-premises.
Data Lake is an object based repository that stores large amount of raw data right from structure, semi-structured and unstructured data in its native format. Data can flow into the Data Lake by either batch processing or real-time processing of streaming data. As opposed to a data warehouse that stores data in files or folders, Data Lake is marked by its horizontal flat architecture where each data element is assigned a unique identifier and is tagged with extended meta tags.
Data lakes usually align with an “ELT” strategy which means we can Extract and Load into the data lake in its original format, then Transform later if a need presents itself.
Some of the benefits of a data lake include-
But without the right storage infrastructure, data management, architecture, and personnel, a data lake can become a data swamp.
All the transactional data, logs and events in its raw form constitute the data lake. It supports the concept of real-time archiving.
While Data Lake can definitely be aligned with other relational database architecture, its gained popularity with Hadoop primarily because Hadoop is an open source platform and according to Hadoop adopters, it provides a less expensive repository for analytics data. A Hadoop Data Lake architecture can also be used to complement an entire data warehouse rather than replacing it.
About Data Lake
About Data Lake Design
Data products are the reason data scientists are lately treated like rockstars. They incorporate data science into the operation of a product or service, using data in smart ways to provide value. It’s more than just analysis: it’s putting insight into production. Every day we use the archetypal data product, Google search, and our every interaction with the service makes it better. Another famous early data product is LinkedIn’s “people you may know” feature, helping you locate people in your social networks.
Data products can input data from their own usage to improve
By observing how users interact with your product, you can learn a lot. Through instrumenting the user interface, analyzing logs, or other ways of deriving data from users, you can gain extra signals that help improve your data modeling.
Data products are bootstrapped and then evolve
Good data products are rarely “done”—through usage and continued investigation, you start to understand better the problem that you’re trying to solve. One of the characteristics of working with data is that it’s best to work in an agile way: often you don’t even know the right question to ask until you’ve explored the problem space. Get a product in use early, then learn, adapt, and evolve the product.
Data products are best built with nimble, multifunctional teams
The rapid cycle of product evolution is best served by a multifunctional team of data scientists, engineers, product managers, and architects. To move fast with data, data scientists need to get the data from engineers, and insights and discoveries from the data science informs product direction. If these people are in disconnected departments, product development moves slowly and can be defeated by poor communication.
Multi-source data, because “GIGO” still applies
Every student learns that “garbage-in, garbage-out” is true of computer systems, and data products aren’t any different. If you don’t have good data going in, you won’t get a good result. However, that doesn’t mean you throw weak data away. Instead, by using as many diverse data sources as possible, you can create models that are robust in the face of any of the individual sources failing or being erroneous.
Data products can learn things from a system that’s otherwise closed
One of the most exciting aspects of data science is that we can use observed data signals to predict the behavior of a system that we can’t directly access or comprehend. Without understanding the semantic import of every web page, search engines can still figure out which is most useful. If you can use data, you can crack pretty much any problem area you want. That’s why Silicon Valley companies are challenging the grocery, taxi, and entertainment industries, to name just a few.
Data products solve a real problem that people have
Technology is important, for sure. It can often make new things possible, and transform whole industries. But for successful products and companies, it’s always the problem that comes first. A great data product focuses relentlessly on solving the problem that the user has, using whatever data and techniques will help. With today’s proliferating options of platforms, tools, and languages, the only practical way to navigate these options is with a laser focus on how they can help solve the human and business problems at hand.
Granularity in authorization means the level of details used to put on authorization rules for evaluating a decision to grant or deny the access. If authorization rule for a resource access as per business need is just based on a particular check(like associated roles), then it is coarse. If business needs require more details regarding end user/actor, current environment conditions (time, date) etc. to grant the access then it is more granular and fine grained authorization.
Attribute-based access control (ABAC) defines an access control paradigm whereby access rights are granted to users through the use of policies which combine attributes together. It uses attributes as building blocks in a structured language that defines access control rules and describes access requests. Attributes are sets of labels or properties that can be used to describe all the entities that must be considered for authorization purposes. The A.B-A.C project has designed and implemented tools for using Attribute-Based Access Control, a scalable authorization system based on formal logic. It maps principals to attributes and uses the attribute to make an authorization decision, e.g., if user1 has the login attribute the login program will alllow them to log in. This library, libabac, is a base on which to build those tools.
Role Based Access Control (RBAC) model provides access control based on the position an individual fills in an organization. When properly implemented, RBAC enables users to carry out a wide range of authorized tasks by dynamically regulating their actions according to flexible functions, relationships, and constraints. This is in contrast to conventional methods of access control, which grant or revoke user access on a rigid, object-by-object basis. In RBAC, roles can be easily created, changed, or discontinued as the needs of the enterprise evolve, without having to individually update the privileges for every user.
XACML (eXtensible Access Control Markup Language) is an open standard which is robust and flexible enough to design policies for authorization independently. The XACML policy language is as expressive as a natural language. XACML is an implementation of the ABAC model. It is the most widely spread implementation of ABAC. OpenAZ, currently in Apache Incubator, provides tools and libraries for ABAC and is implemented on XACML 3.0