InfoQ

The Software Architects' Newsletter
September 2021
View in browser

Welcome to the InfoQ Software Architects' Newsletter! Each month, we bring you essential news and experience from industry peers on emerging patterns and technologies.

This month, we focus on the topic of "Architecting for Resilience". Identified as an "early adopter" trend in the recent Architecture and Design InfoQ Trends Report, designing for resilience, and the emergence of supporting technologies, is becoming increasingly popular, particularly in cloud and microservice-based systems. However, key challenges remain for architects designing such systems, including both human and technical issues.

News

Building Reliable Systems and Teaching SRE Apprentices

In a recently published QCon Plus talk recording, Ana Margarita Medina, senior chaos engineer at Gremlin, shared how she has been using chaos engineering to build reliable systems. She also explored how chaos engineering can be used to decouple a system's weak points, learn from incidents, and improve monitoring and observability.

In a related InfoQ podcast, Thomas Betts spoke with Tammy Bryant Butow, principal SRE at Gremlin, about training new site reliability engineers. The discussion covered the establishment of a formal SRE Apprenticeship program Bryant Butow led at DropBox and explored ideas about the best way to teach people new technical skills. There are benefits for the trainees, the mentors, and the company when people put in the effort to create a formal training program.

Cloud Providers Publish Ransomware Mitigation Strategies

In the last few weeks AWS, Azure, and Google Cloud have posted articles and documentation with suggestions on ransomware mitigation techniques on the cloud, highlighting the main protections and recovery preparation actions. Renato Losio recently provided a summary of these ransomware mitigations on InfoQ.

The preemptive actions suggested by AWS in the guide Ransomware mitigation: Top 5 protections and recovery preparation actions include data encryption, setting up the ability to recover apps and data, applying critical patches to the servers, following a defined security standard, and having monitoring and automating responses in place. Azure focused on what to do before and during an attack to protect sensitive data and ensure a rapid recovery of business operations, and Google Cloud published the article Best practices to protect your organization against ransomware threats, where a similar list of five pillars are defined.

Designing a Microservices Architecture for Failure

This Rising Stack article explores how microservices architecture makes it possible to isolate failures through well-defined service boundaries, and introduces the most common techniques and architecture patterns to build and operate a highly available microservices system.

As with every distributed system, there is a higher chance for network, hardware, or application-level issues. As a consequence of service dependencies, any component can be temporarily unavailable for its consumers. To minimize the impact of partial outages we need to build fault-tolerant services that can gracefully respond to certain types of outages.

Related to this topic, in their upcoming book, Software Architecture: The Hard Parts, Neal Ford and Mark Richards explain that every architectural decision involves trade-offs, and they provide guidance on how to evaluate those trade-offs. In a recent episode of the InfoQ Podcast, co-host Thomas Betts spoke with Neal and Mark about the role of a software architect and the skills necessary to be successful. One of the hardest parts is recognizing that there are no right or wrong answers, or easy decisions, and this can be especially challenging for those who come from a programming background.

 

Case Study

Building Reliable Software Systems with Chaos Engineering

In a recent InfoQ Q&A, editor Ben Linders sat down with Casey Rosenthal, CEO and co-founder of Verica, and discussed how to build reliable software systems with chaos engineering.

Advances in large-scale, distributed software systems are changing the game for software engineering. As an industry, we are quick to adopt practices that improve flexibility and improve feature velocity. An urgent question follows on the heels of these benefits: if we can move quickly, can we do so without breaking things?

The recently published O’Reilly book, Chaos Engineering, by Casey Rosenthal and Nora Jones, explores how Chaos Engineering practices can be used to navigate complexity and build more reliable systems. Frameworks are explored for thinking about complexity. Key practices are proposed for embracing complexity via Chaos Engineering, and case studies are presented from companies that have applied Chaos Engineering to business-critical systems.

Key takeaways from the Q&A included:

  • Complexity is inherent in all IT systems today. Many organizations try to fight or reduce it; however, it is more effective to embrace the complexity.
  • With Chaos Engineering, you can better understand the sociotechnical boundary between humans and machines—you learn about both the technical issues or complexities of your systems and also expose knowledge gaps. In combination, these help people understand the properties of their systems and better respond when future challenges arise.
  • Organizations like Slack, Google, Microsoft, LinkedIn, and CapitalOne are navigating complexity with applied Chaos Engineering, providing better resilience and availability of their products and services.
  • Chaos Engineering isn’t about breaking things—rather, it is about learning things. This has been encapsulated in the Principles of Chaos, which is a set of practices centered on developing well-scoped, testable hypotheses that help identify weaknesses before they manifest in system-wide, aberrant behaviors.
  • ROI and business justifications for Chaos Engineering range from simply increasing team learning and understanding, to building a full-fledged objective measure of the impact on business outcomes (such as KPIs or system performance metrics).

This content is an excerpt from a recent InfoQ Q&A with Casey Rosenthal written by Ben Linders: "Building Reliable Software Systems with Chaos Engineering".

To get notifications when InfoQ publishes content on these topics, follow "resilience", "chaos engineering", and "continuous improvement" on InfoQ.

Missed a newsletter? You can find all of the previous issues on InfoQ.

Sponsored

VMware

"Is Kubernetes a platform? Infrastructure? An application? There is no shortage of thought leaders who can provide you their precise definition of what Kubernetes is. Instead of adding to this pile of opinions, let’s put our energy into clarifying the problems Kubernetes solves. Once defined, we will explore how to build atop this feature set in a way that moves us toward production outcomes. The ideal state of "Produc‐ tion Kubernetes" implies that we have reached a state where workloads are successfully serving production traffic."

Learn more about this topic with our free eBook "Production Kubernetes (By O’Reilly)".

Upcoming events

For practitioners by practitioners

QCon Plus Online Software Conference (Nov 1-12): Catch up on the trends that matter in software development.

How would you design and implement your APIs if you were starting today?

The "API Architecture" QCon Plus Nov 2021 track will highlight the latest tools and techniques around API design and usages including contract-first API development, how APIs are changing for use cases like big data and streaming, and operational aspects of dealing with APIs at scale. Find out more about the track hosted by Thomas Betts, Lead Editor for Architecture and Design at InfoQ.

Save $100 if you book your spot before Oct 9th.

InfoQ Live One-Day Online Event (Oct 19th): How to apply Microservices and DevSecOps to improve application maintainability, security, and deployment speed.

Get valuable insights from world-class domain experts on how increasing modularity and automatic security can help you scale and support distributed development, meeting customer demands. Join Christian Posta, Global Field CTO at Solo.io, and explore common challenges when adopting service mesh and how best to overcome them.

Book your spot at InfoQ Live on October 19th.

 

Senior software developers rely on the InfoQ community to keep ahead of the adoption curve. One of the main reasons software architects and engineers tell us they keep coming back to InfoQ is because they trust the information provided and selected by their peers.

We’ve been helping software development teams adopt new technologies and practices for over 15 years through InfoQ articles, news items, podcasts, tech talks, trends reports, and QCon software development conferences.

We hope you find this newsletter useful. If not, you can unsubscribe using the link below.

Unsubscribe

Forwarded email? Subscribe and get your own copy.

Subscribe