What data workers and business analysts need is a solution that balances ease of use with the capabilities afforded by traditional analytic tools. Therefore availability of data as a self-service to different stakeholders is key to identify patterns, problems and hence drive to solutions. Analytics stack helps to achieve this data intelligence, however, its stack is resource hungry, requires special skills to find patterns and takes time. This is one such organization capability that if built can keep organization future ready. Platform should connect with various data sources, establishing the data pipeline with realtime data flow, with processing and archiving the relevant and derived information. Developers should connect with these data streams to build insights. This data with intelligence dug out can be offered as a product to various stakeholders in the system. This captures lot of business sensitive information and can reveal lot about customers itself. Platform should easy to use access control.
References
The Definitive Guide to Self-Service Data Analytics by Alteryx This article stresses upon the concept of self service in Data, driving the speed, flexibility, ease to use and scalability as driving factors behind it.
Self-Service Analytics And The Illusion Of Self-Sufficiency Author has tried to to explain the difference between self-sufficiency and self-service, using soda fountain analogy, where user is able to pick can from vending machine, where organization has to ensure that there is sufficient resources exist in machine.
Self-service data therefore can be broken down into following 5 key areas, which are explained in following sections.
A data pipeline is a unified system for capturing events for analysis and building products. Key components of Data Pipeline -
Lambda architecture is one of the most promising deployment method in enterprise's data pipeline nowadays. Another architecture that is oftenly used is the kappa architecture.
Apache Kafka is an open-source stream processing platform. Kafka acts as a message queue, where all the data from server logs is queued up and is routed to depending on steam or batch processing. Apache Spark is the popular data processing engine which fits the lambda architecture well thanks to its strength on both batch and streaming computation capabilities. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Unlike lambda architecture, it contains only two layers, speed or real-time layer and serving layer. There is no batch layer. Apache Flink fits best in this architecture.
So when should we use one architecture or the other? As is often the case, it depends on some characteristics of the application that is to be implemented. Let’s go through a few common examples:
A very simple case to consider is when the algorithms applied to the real-time data and to the historical data are identical. Then it is clearly very beneficial to use the same code base to process historical and real-time data, and therefore to implement the use-case using the Kappa architecture.
Now, the algorithms used to process historical data and real-time data are not always identical. In some cases, the batch algorithm can be optimized thanks to the fact that it has access to the complete historical dataset, and then outperform the implementation of the real-time algorithm. Here, choosing between Lambda and Kappa becomes a choice between favoring batch execution performance over code base simplicity.
Finally, there are even more complex use-cases, in which even the outputs of the real-time and batch algorithm are different. For example, a machine learning application where generation of the batch model requires so much time and resources that the best result achievable in real-time is computing and approximated updates of that model. In such cases, the batch and real-time layers cannot be merged, and the Lambda architecture must be used.
One example of a data pipeline design is Keystone used by Netflix. It is an architecture similar to lambda architecture.
It is also a Kafka fronted data pipeline. It contains two Kafka, fronting Kafka and secondary Kafka. The fronting Kafka forwards the events to the routing service. The routing service is responsible for moving data from fronting Kafka to various sinks: S3, Elasticsearch, and secondary Kafka, which forwards it for data processing. It can handle upto -
The speaker, Joe Croback, speaks about how a data pipeline is created and explains different components of a data pipeline.
Apache Kafka - It is an open-source stream processing platform. Apache Spark - It is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Flink - It is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. The reason is that Hadoop framework is based on a simple programming model (MapReduce) and it enables a computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.
As discussed in the above section, two architectures that are used popularly are the lambda and kappa architecture. Lambda architecture uses Apache Spark for data processing whereas the kappa architecture uses the Apache Flink. Both the tools offer high throughput. Apache Flink can process data in realtime whereas Apache Spark is able to process data in near realtime.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only.
Spark uses microbatching computational model and has medium latency. It processes the data in near real-time.
Another alternative to Apache Spark is Apache Flink. It uses stream processing model and has lower latency. It processes the data in real time because of its true stream processing module.
Flink supports true streaming- both at the API and at the runtime level.
Apache Spark - It is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Flink - It is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
Data Lake is an object based repository that stores large amount of raw data right from structure, semi-structured and unstructured data in its native format. Data can flow into the Data Lake by either batch processing or real-time processing of streaming data. As opposed to a data warehouse that stores data in files or folders, Data Lake is marked by its horizontal flat architecture where each data element is assigned a unique identifier and is tagged with extended meta tags.
Data lakes usually align with an “ELT” strategy which means we can Extract and Load into the data lake in its original format, then Transform later if a need presents itself.
Some of the benefits of a data lake include-
But without the right storage infrastructure, data management, architecture, and personnel, a data lake can become a data swamp.
All the transactional data, logs and events in its raw form constitute the data lake. It supports the concept of real-time archiving.
While Data Lake can definitely be aligned with other relational database architecture, its gained popularity with Hadoop primarily because Hadoop is an open source platform and according to Hadoop adopters, it provides a less expensive repository for analytics data. A Hadoop Data Lake architecture can also be used to complement an entire data warehouse rather than replacing it.
A Data Lake solution on AWS, at its core, leverages Amazon Simple Storage Service (Amazon S3) for secure, cost-effective, durable, and scalable storage. We can quickly and easily collect data into Amazon S3, from a wide variety of sources by using services like AWS Import/Export Snowball or Amazon Kinesis Firehose delivery streams. Amazon S3 also offers an extensive set of features to help us provide strong security for our Data Lake, including access controls & policies, data transfer over SSL, encryption at rest, logging and monitoring, and more. For the management of the data, we can leverage services such as Amazon DynamoDB and Amazon ElasticSearch to catalog and index the data in Amazon S3. Using AWS Lambda functions that are directly triggered by Amazon S3 in response to events such as new data being uploaded, we easily can keep our catalog up-to date. With Amazon API Gateway, we can create an API that acts as a “front door” for applications to access data quickly and securely by authorizing access via AWS Identity and Access Management (IAM) and Amazon Cognito. For analyzing and accessing the data stored in Amazon S3, AWS provides fast access to flexible and low cost services, like Amazon Elastic MapReduce (Amazon EMR), Amazon Redshift, and Amazon Machine Learning, so we can rapidly scale any analytical solution. Example solutions include data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, and internet-of-things processing. By leveraging AWS, we can easily provision exactly the resources and scale we need to power any Big Data applications, meet demand, and improve innovation
About Data Lake
About Data Lake Design
A data product is digital information that can be purchased. Data products turn the data assets a company already owns or can collect into a product designed to help a user solve a specific problem.
What many companies have failed to realize is that the raw data they possess could be cleansed, sliced and diced to meet the needs of data buyers. Companies benefit from data products in two ways:
Direct revenue: Charging consumers for access to the data and analysis.
Indirect revenue: Augmenting existing products or services, driving customer loyalty, generating cost savings or creating revenue through alternate channels. For example, if a user listens to music that features a particular actor, he might also be interested in movies of that actor. So, advertisement of movie of that actor can be shown on the music app.
Granularity in authorization means the level of details used to put on authorization rules for evaluating a decision to grant or deny the access. If authorization rule for a resource access as per business need is just based on a particular check(like associated roles), then it is coarse. If business needs require more details regarding end user/actor, current environment conditions (time, date) etc. to grant the access then it is more granular and fine grained authorization.
Attribute-based access control (ABAC) defines an access control paradigm whereby access rights are granted to users through the use of policies which combine attributes together. It uses attributes as building blocks in a structured language that defines access control rules and describes access requests. Attributes are sets of labels or properties that can be used to describe all the entities that must be considered for authorization purposes. The A.B-A.C project has designed and implemented tools for using Attribute-Based Access Control, a scalable authorization system based on formal logic. It maps principals to attributes and uses the attribute to make an authorization decision, e.g., if user1 has the login attribute the login program will alllow them to log in. This library, libabac, is a base on which to build those tools.
Role Based Access Control (RBAC) model provides access control based on the position an individual fills in an organization. When properly implemented, RBAC enables users to carry out a wide range of authorized tasks by dynamically regulating their actions according to flexible functions, relationships, and constraints. This is in contrast to conventional methods of access control, which grant or revoke user access on a rigid, object-by-object basis. In RBAC, roles can be easily created, changed, or discontinued as the needs of the enterprise evolve, without having to individually update the privileges for every user.
XACML (eXtensible Access Control Markup Language) is an open standard which is robust and flexible enough to design policies for authorization independently. The XACML policy language is as expressive as a natural language. XACML is an implementation of the ABAC model. It is the most widely spread implementation of ABAC. OpenAZ, currently in Apache Incubator, provides tools and libraries for ABAC and is implemented on XACML 3.0