AWS Data Platform-Architecture Primer

Surbhi Nijhara
9 min readJan 18, 2022

The foundational concepts that underpin the cloud-based architecture of a modern data platform

Contents

Abstract

Introduction

Architecture Lens

Architecture Blueprint

Reference Cloud Implementation

Measure the Designed Pillars

Conclusion

References and Further Reading

Abstract

This article focuses on helping you understand and address the building blocks of your data platform journey: how should it be approached for architecting? Understanding design considerations can inform your organisational business case and decision-making process when evaluating the value realisation outcomes for a modern data platform goal. This blog provides a primer on the architecture that enables you to apply proven methodologies, at-scale benchmarking, cost modelling, and operational efficiency thus realising value in the modern technology, tools, and workflows used for building and operating the data platform.

Introduction

Building and Operating a modern data platform (MDP) is a much-needed prowess for most of the organisations who are on their digital enablement journeys. The ability to glean actionable insights from their ever-growing data from varied sources only demonstrates the maturity index of an organisation in the space of evolving technologies and trends.

Even though it is not required for MDPs to be always cloud-based, cloud capabilities often play an essential role in the making of MDPs for efficient cost models, elastic scalability, and flexible managed services. The typical traditional Enterprise Data platforms (EDPs) exist in the on-premise or hybrid customer data centers and are made up of traditional data sources like OLTP databases and data warehouses. The traditional tools and processes for data acquisition, preparation, and analytical reporting used in EDPs face limitations with the aspects of velocity and variety and hence the veracity of data.

MDPs when created with cloud computing services and cloud-managed data stores provide unlimited object storage, managed relational and NoSQL databases, MPP data warehouses, Spark clusters, Analytics Notebooks, message queues, and middleware. Further, the managed and resilient toolchain and orchestration services from cloud platforms enable the process to seamlessly chain them all together.

Today we have decent options from cloud and database vendors who have created promising solutions to let customers process and store huge volumes of data in varied formats in platforms-as-services adopting a well-architected framework. Customers have to worry less regarding high availability, scalability, backup, and database operations.

Yet though, the architecture of the platform needs to align with the specific customer needs and a suitable contextual blueprint should be drawn.
This paper aims to take through the building blocks of a modern data platform architecture and a reference implementation guide in one of the cloud platforms.

Architecture Lens

Architecture considerations require a clear business vision where all the organisation stakeholders are aligned to a common goal of providing their customers a performant and flexible user experience when it comes to doing anything with their data.

What should be the focus lens of the architecture team?

Underline business outcome > First understand and gather the business expectations from the required data platform. The organisation may already have a traditional enterprise data analytics platform being used by its customers. Understand the current and potential challenges, the growth rate of data, user experience feedback, competitive index, and last but not least the cost model of operating the existing platform.

Prototype Early > As you get a comprehensive understanding, start prototyping for early feedback. Avoid wasting efforts on building technology stacks from the grooves. Building solutions from the ground up is expensive, time-consuming, and very rarely provides any direct value to your organization.

Ask, Re-Ask and Delegate to Cloud > There are many choices available to opt for cloud-managed services for building and orchestrating the MDP on the cloud. To make the choice rightfully, some of the below business and technical questions should be answered, at times more than once. This is because the more times you ask, the more answers and details you receive and you can be more decisive on which cloud service to delegate.

How many types of data sources connectors are required?
This will help decide if a single cloud-native ingestion service will suffice or other ingestion tools need to be considered.

What is the tenancy pattern of the relational data sources and required data sink? Tenancy Patterns influence the design to a great extent. Single tenancy for analytical purposes is often an ask by the organisation and its customers. This can mean a lot of storage and many parallel connections to be established between the source and sink for the initial load as well as change data capture management.

How many customers and tables can be concurrently updated to account for the design of capturing the incremental changes?

This will help choose appropriate cloud-native workflow services combined with event-based serverless functions versus a single service to orchestrate.

What is the acceptable near-real-time and batch window?
This will make you choose the right service configurations and the mechanism to schedule versus trigger.

How complex is the transformation logic?
This will help decide if cloud-provided spark-based APIs can be used or native SQL-based transformation scripts will be more performant and less expensive.

Which services are required to exploit the data to realize and deliver value throughout the business?

The answer to this will be useful in deciding the set of analytical tools for exploring your data and unearthing the value of your data.

Measure Continuously > Learn from your prototype and map it back with the business expectations. Without measuring business value continuously, you may drift from the possibly changing product requirements and will not be able to justify any expenditure or drive any value from your testing early on in the project. A number of metrics can be measured to validate the success of the experiments.

Architecture Blueprint

Once the architectural considerations and measurable metrics are captured, an architectural blueprint should be outlined. This should be followed by prototypes to assess technology, methodologies, and most importantly, exploration of the data to indicate whether the blueprint to deliver the business goals can be incubated into an end-to-end implementation. Below is a reference architecture to fulfill most of the data platform requirements.

Data Platform Logical Architecture Blueprint

Because the aim is to get started on building your data platform, let’s break down the blueprint into the S-quadruple that are typical in any data platform project:

  • Sources for Data Ingestion
  • Storages for Data Processing and transformation
  • Sinks for Data Analytics, Visualization, and machine learning
  • Services for Data Access

Defining the 4S — Source, Stage, Sink, and Service for the above phases enables us to achieve the functional business layers.

Further, the below layers should span through the 4S to fulfill the well-architected framework for any modern data platform.

● Orchestration,
● Security
● Data Ethics and Governance

Let us see in brief why each of one the above.

Source

What could be your possible different data sources?
Are they event-based, streams, or 3rd party feeds?
Or are they from transactional systems?
This will help you decide whether real-time streaming or batch-based ingestion mechanism has to be constructed.

Stage

What could be the various stages of the data as it moves from as-is to to-be?

As data moves from data source(s) to the data sink(s), it goes through various stages like raw ingestion, cleansing and deduplication, and transformation. The process is commonly known as ETL or ELT process as per the context. Each of the stages may require its own staging area for the required processing.

For example, the Batch landing zone shown in the diagram is an intermediate storage area used for data extraction from transactional data sources.

Sink

How do you want to analyze and visualize your data?
What kind of intelligent insights would you like to derive from your data?
Sinks are often data warehouses, data marts, or other data repositories.

Service

Who wants to access and explore the data and in which form?

The online data services, especially requiring the raw and processed data from the real-time data sources should be throughout the account.

Orchestration

How will an ETL pipeline run in an automated way with graceful error handling and logging of important checkpoints?
How will monitoring, alerting and remediation occur?
They will tackle the needed error recovery and save overall time in the ETL processing through the different stages.

Similarly, security including authentication, authorization, encryption, and data masking should be across the sources, stages, sinks, and data services.

Security

Who should have access to which environment and what should be those accesses?
Which regulatory compliances are required to be adhered to?
Which data fields have to be masked for which access?
Data Security from various aspects needs to be considered and applied. Secure data is the central pillar for a data platform and no compromise on this front will ever be acceptable by any of the organization’s customers.

Data Ethics and Governance

Is your data usable, accessible, and protected?
With time and data growth, is data quality getting improved? Is the data management cost decreasing, and is access to data for all stakeholders increasing?
Data Governance provides a holistic view of data trust across five key pillars of observability, including freshness, schema, and lineage.

Reference Cloud Implementation

The data platform architecture can be realized on different cloud platforms entirely natively or using hybrid clouds. Here in this paper, we see a reference implementation of data platform components from different cloud vendors — AWS and Snowflake.

Cloud-based Architecture

Reference AWS Cloud Data Platform Architecture

AWS S3 based data lake and Snowflake as data warehouse is used in the reference architecture. For powerful visualizations, Tableau is hosted on an auto-scaling EC2 Cluster. The horizontal spectrum consists of out-of-box AWS offerings for the purpose of orchestration, monitoring, security, and data governance.

Cloud-based Orchestration

As seen earlier in this document, modern data platforms depend on extract, transform, and load (ETL) operations to bulk convert information into usable data. Implementing an ETL orchestration process that is preferably as loosely coupled as possible becomes an important design consideration. Orchestration again will depend on your specific sources, stages, and target sinks. A reference implementation using AWS native services is shown below.

Reference ETL Orchestration using AWS serverless

Cloud Infrastructure-as-Code

Use serverless computing and Infrastructure-as-Code (IaC) to implement and administer a data platform on the cloud. Following is a reference implementation of a Continuous Integration/Continuous Deployment (CI/CD) process throughout the code and infrastructure deployment by using Cloud services such as Azure DevOps and Terraform — a cloud-agnostic IaC tool.

Reference CI/CD Orchestration for ‘Infrastructure’ deployment
Reference CI/CD Orchestration for ‘Code’ deployment

Measure the Designed Pillars

The pillars of building any architecture including a modern data platform stay the same. What differs is the context of the architecture and the knowledge of the frameworks and tools that can help to stand up the pillars efficiently and in the quickest of time. Cloud technologies will not only help you to create these pillars but ensure it serves your architecture implementation loyally by providing you with ways to measure and improve them continuously.

Performance and Reliability

➔ Benchmark the data pipelines always, even during experiments, to measure the scalability and reliability of analytics pipelines and predict the behavior for increased workloads.

➔ Optimise the runtime of data pipelines by parallel executions for ETL of incremental data.

Operations and Security

➔ Evaluate the type of storage needs based on access patterns. Create archival and deletion life cycles for data at every stage.

➔ Monitor ETL and Analytics pipeline health. Low-to-No Error and Exception Index indicate that operations of the platform are excellent.

Cost Modelling

➔ Identify the changing workload patterns, velocity, variety, infrastructure usage along with access patterns, and choose cloud-based services accordingly. If not opted initially, think of upgrading the architecture technologies to serverless cloud alternatives to reap benefits from the cost-per-use model.

Conclusion

Customers struggle with starting their platform project because it is difficult to consider design aspects when you have no knowledge, experience, or foresight of their unique requirements as an organisation. Without prescriptive guidance, projects fail to get budget approvals and organisations miss the enormous value that data-driven insights can offer. This article offers a way forward. We have shown how you can approach the challenge and the unknown. You can use the reference templates to build out a picture of what your modern data platform will look like. The reference templates can help your organisation start a journey towards making data-based decisions during architecture considerations and drive business value, offering benefits for your organisation and its customers.

References and Further Reading

AWS Documentation: https://aws.amazon.com/big-data/datalakes-and-analytics/

--

--