OVERVIEW:
The following case study describes a data lake solution that we developed and continued to maintain for
one of our digital marketing clients.
This is our 3rd iteration implementing data lakes in AWS. This time, we are taking the challenge to
commit to a serverless approach. The lake’s cost at rest should be minimal.
Every chain is only as strong as its weakest link. So we also take the opportunity to formally review
our solution based on Amazon’s well-architected framework and optimize it to evenly meet Amazon’s 5
pillars of excellence for business operations, security, reliability, performance, and cost.
Goals:
1) Host a full US consumer header with 300 million datasets of 400 data points with monthly updates.
2) Ability to manage 1 billion business leads and associated data records in structured, semi-structured, and unstructured formats.
3) Architect a solution optimized for business agility, monetization, compliance, and user experience.
4) Validate the implementation based on Amazon’s 5 pillars of excellence: business operations, security, reliability, performance, and cost.
5) Completely separate data storage from data processing and shift towards a low-maintenance serverless implementation where possible.
6) Prepare the client for advanced data analytics and applications of true machine learning.
Technology:
We selected AWS S3 for storage, secured by IAM, monitored via Cloud Watch. The automated ingestion process utilizes
the >AWS Glue crawler and Data Catalog. We maintain the metadata in DynamoDB. Automated ETL is executed via
Lambda managed under the API Gateway. It uses Athena CTAS and ETL queries. Athena is now also the main
analytical query engine. Our admin dashboard is implemented using ReactJS and Highcharts for data visualization. The dashboard
is connected to our API Gateway. Our operational Consumer Header services are still partially implemented in NodeJS and
Python hosted on EC2 with RDS / Postgres for quick relational transactions.