- [Raf] One of the core capabilities of a data lake architecture is the ability to quickly and easily ingest multiple types of data, either in terms of structure and data flow, such as real-time streaming or bulk data assets from external platforms. You may already know the difference between batch and streaming data. In a nutshell, we usually think about streaming when we have a high volume of small bits of data that needs to be processed or stored while data is arriving. But since ingesting a high volume of data at a lower latency may come with a cost, you can benefit from ingesting your data in batches if a longer time to insight can be afforded. AWS offers a whole suite of data ingestion services, and I strongly suggest you visit the course notes to get more information about them. But one thing is for sure, it is impossible to talk about data streaming on AWS without talking about Amazon Kinesis, because Kinesis is the most popular data streaming service in the AWS cloud. Amazon Kinesis is actually a family of data streaming services divided into Amazon Kinesis Data Streams, Kinesis Analytics, Kinesis Firehose, and Kinesis Video Streams. Let's talk about Kinesis' pure form, which is Kinesis Data Streams, then I will talk about Firehose and Analytics. This diagram shows you a Kinesis Data Stream with the data producers on the left side and data consumers on the right side. In this example, I am illustrating IoT sensors, web server logs and clickstream data on the producer side, and an EC2 instance and Lambda functions as consumers. The EC2 instances are doing some processing and the Lambda functions are just storing data into S3. There are three important things I want to talk about in this diagram. First, Kinesis is data agnostic. Second, data retention period and data consumption replay. And third, pull-based mechanism. Understanding those concepts will make you powerful enough to explain what Kinesis is to all your family, friends, and co-workers. Isn't that nice? The first one is the fact that Kinesis is data agnostic. You can see that by the different data producers sending data to the same Kinesis Data Stream. You can send JSON or XML, structured or unstructured data. If you set it properly by using the Kinesis SDK, Kinesis will store it. You may want to create one Kinesis Data Stream per data type or concentrate data in the same Kinesis Data Streams like I am doing in this diagram. The way you design your architecture is up to you. The second aspect of the diagram is that data remains in the Kinesis Data Streams, regardless if it has been consumed or not. Data stays there until it expires according to what we call a data retention period. That data retention period is 24 hours by default, but you can extend that by doing service API calls, which would also increase the hourly cost of the Kinesis Data Stream. Keeping the data in Kinesis is valuable because multiple consumers can consume the same data at the same time. In addition to that, it allows replaying the data consumption in case of data consumer failures. Let me explain. Imagine you have an application consuming the data. If that application crashes, you can run it again and set a past Start Position. If that past Start Position is within the data retention window configured in the Kinesis stream, your application can consume data added to the stream while it was offline. The third thing I want you to notice in the diagram is the direction of the arrows on the data consumer side, on the right side of the diagram. In some diagrams, you may see arrows from left to right because that's the overall direction of the data flow. But I think it is more technically accurate when you point arrows from consumers to Kinesis and not from Kinesis to consumers and here is why. Kinesis does not work as a push-based delivery mechanism. If a consumer wants to consume Kinesis data, it must initiate a connection using the Kinesis Client Library, available for most popular programming languages. That code can run from on-premises, from within an EC2 instance, from your laptop, or from within a Lambda function. People usually write Kinesis consumers for two main reasons: getting the data and placing it somewhere, such as a storage service, and two, performing some real-time analysis of data that is passing on the stream. Speaking about the first one, what would be the most popular data storage service in AWS? Let me give you a couple of seconds. I would agree with you if you said Amazon S3. AWS realized that most customers were riding Kinesis consumers just to get data and move to S3 with minimal modification. It's just compression or encryption. To make your life easier, we created Amazon Kinesis Firehose. Firehose is part of Amazon Kinesis Family and helps you have data available for multiple destinations. With a few mouse clicks in the AWS Management Console, you can have Kinesis Firehose configured to get data from a Kinesis Data Stream and put into a destination like Amazon S3, Redshift, Amazon Elasticsearch, HTTP endpoints, or third-party service providers such as Datadog, Splunk, and others. Now let's talk about real-time analysis of data that is passing by the stream. If you want to control the code that is analyzing your data, the easiest and most popular way is via AWS Lambda functions. Since both are AWS managed services, there is some polling made behind the curtains that invoke the Lambda function when there is new data in the stream. So the Lambda function that receives the data receives the data without having to concern with the polling part, looking a lot like a push-based architecture. Interacting with Kinesis via Lambda is easier than hosting code on EC2, but still requires you to write the code that will run in the Lambda function. Although there is some sample code in the AWS Documentation, you may want to choose an even more convenient way of analyzing that streaming data. If you want to go fully serverless without writing that code, you can rely on Kinesis Analytics, a powerful real-time processing service with no servers to manage, and as usual, paying only for what you use. Kinesis Analytics allows you to write SQL queries to process data in real time, providing an easy way to query streaming data without having to learn new frameworks or languages. You can also write Apache Flink code for more sophisticated analysis. Flink is an open source framework and part of the Apache foundation. Kinesis Analytics has two main concepts that are easy to understand: in-application streams and data pumps. Those concepts provide you with the necessary abstraction to handle data that is passing through a data stream like water passing through a pipe. The additional readings of this week contain more information about those in-application streams and data pumps. The Kinesis Analytics Documentation gives you detailed examples with sample queries ready to copy and paste to create both the in-application stream and data pumps. I strongly suggesting you giving a try. As a recap, this diagram shows where you could use Kinesis Analytics and Kinesis Firehose to help you better understanding the scope of each service. Last but not least, you can also use Kinesis to ingest video and build video analytics applications. Kinesis Video Streams makes it easy to stream video from connected devices to AWS. You can use Kinesis Video to make an application that would interact with a camera-enabled doorbell from your mobile phone, for example. And that's it. I hope you enjoyed learning a little bit more about the AWS Kinesis Family in this video.