Data Lake with AWS S3 — Part 1/3

What is the data-lake and why should you consider building one for your business?

Essentially a data-lake is a way for you to monetise and get business value out of your data.

Gaming Industry leveraging Data Lake for Customer Engagement

For instance, Fortnite from Epic games has been a wildly successful game which scaled incredibly fast (to 125+ million players), and the way they did this is with an engagement model with their customers wherein they essentially monitor interaction with the game in near real-time, run analytics on that data and in the process constantly customise the game to offer better player experience. So essentially a real-time feedback loop that makes the game very responsive to customer’s usage.

Example AWS DL Arch for Gaming Industry

AWS can be used as a data-lake, both to host the game platform as well as to do the analytics that make that platform run and keep the players engaged.

Sample Illustration: AWS Data Lake Architecture for Gaming Industry

AWS S3 can be used to store all the telemetry data that’s coming from large number of gamers, then analyse that data that is streaming-in and near real-time using this near real-time data pipeline architecture (comprising of Spark and DynamoDB). Also put that data in batch pipelines (comprising of S3, EMR, etc.) which can be later used to run deeper analytics, run machine learning models, etc. So it essentially gives them both a real-time engagement engine, as well as much deeper insights over time to really optimise and add new features to keep the game responsive.

Data-Lake as a Journey…

When you start to think about data-lake and what it means for your business, it’s going to be a journey. Whether you’re just starting out and you’ve never done any meaningful analytics before, or you’re just used to maybe doing basic business insights, or whether you’re a very innovative practitioner of analytics; there’s probably always going to be room for improvement/evolution in what you’re doing.

So really when we think about building architectures for data-lake, we want them to be able to be built upon time where you can transform and innovate at the pace that makes the most sense for your business.

When we think about data-lake architecture, one of the fundamental concepts is you have to be able to evolve around your data seamlessly as your skills grow, and even more importantly be able to innovate and experiment with your data non-disruptively; determine if a new algorithm or a new processing tool is going to add value and then quickly scale it into production. That’s going to require you essentially evolving your tools and methods around the data.

Why AWS for Big Data Analytics and Data-Lakes?

A big part of this is agility. You want to innovate as fast as you can innovate and not have that blocked by the infrastructure/tools/platform that you’re using to drive this innovation. So essentially AWS is focused on giving you a platform that you can:

  • Evolve in a very agile fashion, pop in a new functions, new services, new capabilities, as you need them.
  • Try them out and either fail fast and move on to the next thing for very low cost; or if that’s a successful experiment quickly scale it because scale is the other element of the data-lake.

Agility & Scalability:

Essentially the platforms and the tools provided by AWS to build data-lakes are inherently scalable, up to hundreds of petabytes and exabytes of data.

Capabilities:

You’ve got to have a broad and a deep array of capabilities that you can bring to the data to get value out of the data, because your use cases are going to be unique to you. And as we move ahead in the data-lake architecture journey your skills are going to evolve, you’re going to want to do new things, get new insights and so you’ve got to have a very broad portfolio that isn’t going to inhibit you, so that you can find (whether it’s an AWS native service or a partner) the right tool for the right job.

Cost:

Another key thing is we gotta be focused relentlessly on cost. Your data volumes, if you’re successful, are going to grow beyond your imagination. When you’re starting out your budget is not going to grow that fast, so essentially we have to be able to keep cost optimised and not scale the cost of the infrastructure as your usage grows.

Migration:

This probably isn’t going to be an all AWS solution. You’re either going to have legacy equipments, on-premise or third-party data sources. And so data migration and integration with data-lake has got to be easy.

Faster Insights:

That’s a competitive advantage for you. Quicker time to market, quicker ability to offer new services. So getting to insights faster is really one of the fundamental foundations we need to focus on when we architect this data-lake.

So how do we define the data-lake at AWS?

There have been a lot of definitions that have been out there starting from the early Hadoop days, where a data-lake was really all about HDFS.

Define Data-Lake

But we want to take a more expansive view, and the way we define it here is it’s got to incorporate both relational and non-relational data. The traditional data lakes have been all about structured and semi-structured data from a variety of sources, but when we look at more and more new use-cases we start to see more unstructured data types, things like video or radar data or LIDAR data, etc. It’s not just all about Hadoop or all about data-warehousing, but it’s really about a wide variety of tools that can dig in and do exactly what you want to do with the data.

Data-Lake on AWS

So breaking that down to the next level, what does a data Lake on AWS actually look up, look like? At the foundation, the central part of it is S3.

AWS Data Lake

1. Data Ingestion

So the first thing you have to do is get your data into S3.

Data Ingestion into AWS S3

AWS has a whole host of data ingestion tools to help you do that. AWS Kinesis is a family of streaming data tools to ingest data in, that can be log data, streaming video data, etc. Addition to this AWS has something called Kinesis Analytics where you can actually analyse the data as it’s streaming-in and make decisions on that data before it even lands in the data-lake.

You’ve also got a lot of traditional databases, either on-premises or in the cloud that are relational data that you’re going to want to integrate into the data-lake, so AWS has a Database Migration Service.

For lab equipment on-premises that doesn’t necessarily speak object storage or an analytics interface, but as used to talking to a file device, you can use AWS Storage Gateway to integrate and shift it to the cloud. And then finally, you may have an existing Hadoop cluster on-premises or a data warehouse. You could set up AWS Direct Connect and start to, set a direct network connection between your on-premises environments and AWS services.

You may have a lot of data that you’ve collected in on-premise storage devices and you want to get that into the data-lake. But it’s difficult to keep those two worlds in sync. So AWS introduced DataSync to help you do this. It’s a very high performance agent, that you can install pointed at your existing on-premises storage, and it will automatically transfer and then synchronise that data with AWS. It’s easy to use, high performance, and it allows you to keep your on-premise environments that you may stage data in synchronised with your data-lake in AWS without any manual intervention.

Data ingestion is key to making your data actionable, and you’ve got to pick the right tool for the right type of data.

2. Catalogue

The second part, which is really fundamental to building a data-lake. Without a data catalog you don’t really have a data-lake, you just have a storage platform. If you’re actually going to take your data and get insights from it, you’ve got to know what you have, what type of data it is, what the metadata associated with and ultimately how different data sets relate to each other. So that’s really where AWS Glue comes in is. It is a very robust, very flexible data catalog using which you can quickly crawl data, classify it, catalog it, and then take insights upon it.

3. User Interface

The next element is you’ve gotta be able to present that data once you’ve analysed it and come up with insights from it to a whole variety of users. So you may do that directly, you may do that through analytic tools that speak SQL natively, or you may decide to, put things like API gateways in front of it and set up a, almost like a shopping cart data consumption model. So AWS has got a wide variety of tools like API Gateway, Cognito, AppSync to help you, build those user interfaces on top of your data-lake.

4. Security

Managing security and governance is also a foundational aspect. It wouldn’t be a usable data-lake if it wasn’t secure because ultimately a data-lake is really about taking a bunch of individual silos of data, integrating those and getting greater insights. If you have a lot of silos, it’s a much easier job in the context of security, when you take all data, all users and bring them to a common platform. So AWS has got a whole array of security and management tools which we’ll go deeper into to help you do that in a very secure, very robust, very granular fashion.

5. Analytics

Ultimately a data-lake is really about getting value out of your data and that comes down to what analytic tools are you going to use to do that? So AWS has a whole host of native tools that allow you to both query your data in place using things like Athena, Redshift Spectrum, SageMaker , etc. as well as a whole host of third-party tools that are much more performance and scalable for applications like Spark or data warehousing.

AWS S3 Best Place for Data-Lake

  • AWS S3 was designed for 11 9's of durability and high levels of availability. It’s Amazon’s second oldest service (about 13 years old) and has massive scale spanning over exabytes of data and trillions of objects.
  • With security as one of its foundation aspects, there’s a very wide number of security compliance and audit capabilities that are native to S3. As your data-lake may grow you might want to be able to control those objects down to the individual object level. This could be to do things like give very fine grain security and access characteristics, or set up very intelligent data management policies that are going to help you optimise costs.
  • You’re also going to want to get business insights into your data and this is different than analytic insights into your data. This essentially is looking at how your data’s being consumed by different customers so you can charge them appropriately.
  • And finally data ingestion capabilities. You’ve got to get the data in before you can do anything with it. So there are the more ways to bring data into the AWS S3 compared to pretty much any other platform out there.

One of the ways you can improve time to is work on the data where it lives without having to take that data and move it to another platform. And so essentially processing the data in place is another revolutionary feature introduced by AWS called AWS S3 Select. (Explained here)

And finally, it’s gotta be designed for both low cost of storage and analytics where you only pay for what you consume. (Explained here)

Continue Reading…

[1] Amazon S3 https://aws.amazon.com/s3/
[2] Amazon Athena https://aws.amazon.com/athena
[3] Amazon DynamoDB https://aws.amazon.com/dynamodb
[4] Amazon Elasticsearch Service https://aws.amazon.com/elasticsearch-service
[5] Amazon Kinesis https://aws.amazon.com/kinesis
[6] AWS Database Migration Service https://aws.amazon.com/dms
[7] AWS Storage Gateway https://aws.amazon.com/storagegateway
[8] Amazon QuickSight https://aws.amazon.com/quicksight
[9] Amazon Cognito https://aws.amazon.com/cognito

--

--

--

Architect @fosfordata, Founded and sold @canishub @onswap, worked @paypal @oracle @ThaleaDigiSec @mastekltd

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Client-side computing

Join the MetaChess Territorial Contest! Win SHAH Tokens & an NFT

Definition of Logic Gates | Positive and Negative Logic | Truth table | Types of Logic Gates

How RDS proxy allowed us to run Airflow 200% more efficient

The center Of Seattle's Tech Community

How to connect the Amazon s3 store with plugins to WordPress Website

CVE-2020–24977

Traffic Shaping in Linux

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Piyush M Agarwal

Piyush M Agarwal

Architect @fosfordata, Founded and sold @canishub @onswap, worked @paypal @oracle @ThaleaDigiSec @mastekltd

More from Medium

Optimizing Cloud Costs for Deep Learning Trainings

Serverless Data Integration with AWS Glue

RedShift Get RealTime Alert When DDL Or Sensitive Queries Executed

Manually Installing an ElasticSearch Cluster on GCP