InfoQ

The Software Architects' Newsletter
February 2023
View in browser

Welcome to the InfoQ Software Architects’ Newsletter! Each month, we bring you essential news and experience from industry peers on emerging patterns and technologies.

This month, we focus on “Modern Data Processing: Data Pipelines, Streaming, and Data Mesh.” These core topics span the entire “diffusion of innovation” graph in our 2022 AI, ML, and Data Engineering InfoQ Trends Report. We see increasing adoption of stream processing, distributed computation, and "data lake-as-a-service."

Key challenges remain in this space, including being conscious about how data pipeline architecture decisions are made at scale (and with the required speed) and bringing the social and ethical elements into the sociotechnical systems in which we all now work.

News

Amazon Athena Now Supports Apache Spark Engine

Amazon Athena now supports the open-source distributed processing system Apache Spark to run fast analytics workloads. Data analysts and engineers can use Jupyter Notebook in Athena to perform data processing and programmatically interact with Spark applications.

Doordash Introduces ML to Understand the Marketplaces Status

DoorDash recently introduced an ML model to predict the operational status of a store to improve the user experience and save thousands of order cancellations. Understanding the merchant’s operational status and the ability to receive and fulfill orders is crucial for the DoorDash platform.

Azure Now Supports Database as a Service Couchbase Capella

Couchbase recently announced that its Capella database as a Service (DBaaS) offering as a fully-managed service is available on Azure.

Couchbase Capella is a fully-managed JSON document and key-value database with SQL access and built-in full-text search, eventing, and analytics. It can be globally distributed via AWS and GCP through their data centers and availability zones. In addition, the company now makes the database available on Azure, allowing its customers the flexibility to use Capella across all three major cloud providers.

Streaming-First Infrastructure for Real-Time Machine Learning

Many companies have begun using machine learning (ML) models to improve customer experience. This InfoQ article explores the benefits of streaming-first infrastructure for real-time ML.

Two scenarios of real-time ML are covered. Online prediction is where a model can receive a request and make predictions. Continual learning is when machine learning models can continually adapt to changes in data distributions.

Resilient Real-Time Data Streaming across the Edge and Hybrid Cloud

In this QCon Plus talk recording, Kai Waehner explores different architectures and their trade-offs for transactional and analytical workloads. Real-world examples include financial services, retail, and the automotive industry.

 

Case Study

Building and Operating High-Fidelity Data Streams

At a previous QCon Plus, Sid Anand, Chief Architect at Datazoom and PMC Member at Apache Airflow, presented on building high-fidelity nearline data streams as a service within a lean team. In this summary of the talk, Anand provided a master class on building high-fidelity data streams from the ground up.

In streaming architectures, any gaps in non-functional requirements can be very unforgiving. Data engineers spend much of their time fighting fires and keeping systems up if they don’t build these systems with the "ilities" front of mind. They must consider nonfunctional requirements such as scalability, reliability, and operability at all steps in the pipeline.

To build a reliable data pipeline, conceptualize the data streams as a series of links in a chain where each link is transactional. Links are connected via Kafka topics, each of which provides transactional guarantees. Once the links are combined, the overall pipeline can become transactional.

The two most important top-level metrics for any streaming data pipeline are lag and loss. Lag expresses the amount of message delay in a system. Loss measures the magnitude of loss as messages transit the system.

Most streaming data use cases require low latency (i.e., low end-to-end message lag) but also require low or zero loss. Performance penalties exist when building a loss-less pipeline - i.e., engineers must give up some speed to build for zero loss. However, there are some strategies to minimize lag over an aggregate of messages (i.e., increase throughput via parallel processing). Doing this may lead to a lag floor, but throughput can still be maximized.

This content is an excerpt from a recent InfoQ article by Sid Anand, "Building & Operating High-Fidelity Data Streams".

To get notifications when InfoQ publishes content on these topics, follow "AI, ML, and Data Engineering", "Streaming", and "Data mesh" on InfoQ.

Missed a newsletter? You can find all of the previous issues on InfoQ.

Sponsored

Cockroach Labs

Want to keep pace with the latest thinking and innovation in the world of application architecture? Find Big Ideas in App Architecture on your favorite streaming platform to hear real stories from data innovators at Greenhouse, DoorDash, Twilio, and more. Listen now!

Sponsored by Cockroach Labs

Upcoming events

QCon: For practitioners, by practitioners


QCon Software Conferences 2023: Early bird prices end March 6

Learn the use cases and best practices needed to stay ahead in software development. Join your peers in-person. Or get on-demand access and select live sessions with our new online ticket. Book before March 6th and save with limited early bird tickets.

QCon London - March 27-29, 2023.

QCon New York - June 13-15, 2023.

QCon San Francisco - October 2-6, 2023: Book before March 29th to secure launch pricing.

Senior software developers rely on the InfoQ community to keep ahead of the adoption curve. One of the main reasons software architects and engineers tell us they keep coming back to InfoQ is because they trust the information provided and selected by their peers.

We’ve been helping software development teams adopt new technologies and practices for over 15 years through InfoQ articles, news items, podcasts, tech talks, trends reports, and QCon software development conferences.

We hope you find this newsletter useful. If not, you can unsubscribe using the link below.

Unsubscribe

Forwarded email? Subscribe and get your own copy.

Subscribe