Data Lake with AWS S3 — Part 2/3

Processing Data in Place, S3 Select & FSx for Luster

Piyush M Agarwal
5 min readNov 22, 2020

In our previous blog we discussed ways to get data into our systems. Everything from IOT sensor data, traditional databases, on-premise equipments, traditional Hadoops, offline data etc. It’s not going to be one or the other type of data that you’re going to ingest, you’re going to be ingesting all these different types of data into the data-lake concurrently and it’s all going to come in on different schedules.

Legacy Support from AWS

For a lot of people SFTP has been a way that enterprises have used to transfer data. For decades a lot of those typical on-premise legacy environments use this to move their data around, so AWS introduced a new transfer capability that essentially is an SFTP service. With this service users can now connect their on-premises legacy mainframes natively with your data-lake. So effectively you don’t have to change your processes that you’ve already developed for data-migration and ingestion activities. It’s fully managed, secure, compliant, cost-effective and easy-to-use and integrates all of that existing environment into an AWS data-lake seamlessly.

Another challenge is to keep the on-premises and data-lake worlds in sync. So AWS introduced AWS DataSync, it’s an agent that you can install pointed at your existing on-premises storage and it will automatically transfer and then synchronise that data with AWS. It’s easy-to-use, high performance, and it allows you to kind of keep your on-premise environments data in synchronised with AWS data-lake without any manual intervention.

But ultimately once you get that data in and start to think about how to process it, a key concept for both reducing time and driving a quicker time to result is going to be processing that data in place.

Processing Data in Place

So once you get the data in S3, you have AWS serverless services that can work directly against that data in S3. These are:
- AWS Athena for ad hoc query
- Redshift spectrum for high performance data warehouse
- SageMaker for training and deploying machine learning models and
- AWS Glue to catalog and transform that data and get it ready for analytic use.

Let’s deep dive into these.

Models to Process Data and Get Insights

The first is around user defined functions, where you may have custom functions that you’ve developed to analyse your data that just run on compute, the usual path AWS recommends for that is to use serverless computing capability which is called Lambda. So you’re bringing your own functions and code, but you can execute that against data in your data-lake without having to worry about provisioning and managing and operating physical servers.

Further more if you want more standardised functions like, running SQL queries on your data or operating a data warehouse, or cataloging and transforming your data, AWS has a wide number of fully managed serverless services.

You just worry about interfacing and running the functionality that you want to do, and you don’t worry about physical interfaces.

If you take something like AWS Glue, you can define crawl-jobs which will crawl and catalog your data that’s in S3 and then transformation-jobs that will transform it without you worrying about physical infrastructure. Same thing with queering your data, if you just want to do ad-hoc SQL queries on your data, just spin-up a third party tool that might be something like a Presto running on Hadoop, instead you can just go to Athena console, type and execute your SQL queries while AWS handles all the underlying physical infrastructure.

Well it doesn’t end here..

Even if you’re using a serverless tool like Athena or Redshift spectrum, if you wanted to access an object, you accessed that whole object and the analytic tools scan that object pulled out the data and threw the rest on the floor. That was just a fundamental fact of life.

Is there another way to accelerated this ability to query the data intelligently in the storage layer? Welcome S3 Select 👍

AWS S3 Select

AWS S3 Select

AWS introduced S3 Select, where rather than fetching a whole object and retrieve that object before you could do something with it, you could push a SQL statement down to the storage layer and the storage layer would only return the part of that object or parts of that object to match that SQL expression. So now you’ve got the storage layer itself doing a lot of the heavy lifting of scanning and filtering data before it ever gets returned to the analytics tool.

So essentially what this means is if you’re pulling and accessing less data out of the storage layer which translates into higher performance because the storage layer is doing that scanning for you and your data doesn’t have to go over the network to another tool. Also much lower cost because a lot of analytic tools charged by the amount of data process and scan. So if you do that in the storage layer and only return that qualified data, that’s up to 80% lower cost.

AWS FSx for Luster, final frontier

In domains like autonomous vehicle, where you’ve got both traditional analytical data as well as a lot of datasets like a video, images and radar data, that you would use HPC models where you’d spin up big compute instances to build a specialised high performance storage layer, and then do your high performance compute.

Another domain could be healthcare life sciences, when you’re looking at things like genomics data or scientific research. You really want to combine HPC methods with traditional data-lake methods and integrate all those types of data on a single platform.

So to accommodate these AWS introduced FSx Luster.

Amazon S3 with FSx for HPC compute

Luster has been the gold standard for HPC large compute architectures. Rather than have to build two separate environments to do this new type of processing, AWS allows you to put all your data in S3 and then be able to spin up a very high performance file system in front of S3 that could drive HPC compute clusters.

You can load the data from S3 into FSx Luster and process it with a large scale compute cluster, this gives you a way to do very high performance compute at a very low cost at very large scale where you are only paying for when you’re doing those a high-performance compute jobs.

Continue Reading…

[1] Amazon S3 https://aws.amazon.com/s3/
[2] Amazon Athena https://aws.amazon.com/athena
[3] S3 SELECT Command https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html
[4] Amazon FSx for Lustre https://aws.amazon.com/fsx/lustre/
[5] AWS Lambda https://aws.amazon.com/lambda/
[6] AWS Transfer Family https://aws.amazon.com/aws-transfer-family/
[7] AWS DataSync https://aws.amazon.com/datasync/

--

--

Piyush M Agarwal

Architect @fosfordata, Founded and sold @canishub @onswap, worked @paypal @oracle @ThaleaDigiSec @mastekltd