Welcome to the InfoQ Software Architects' Newsletter! Each month, we bring you essential news and experience from industry peers on emerging patterns and technologies.

This month, we focus on the topic of "Architecting for Resilience". Identified as an "early adopter" trend in the recent Architecture and Design InfoQ Trends Report, designing for resilience, and the emergence of supporting technologies, is becoming increasingly popular, particularly in cloud and microservice-based systems. However, key challenges remain for architects designing such systems, including both human and technical issues.

News

Making Sense Out of Incident Metrics

Vanessa Huerta Granda, Solutions Engineer at Jeli, recently wrote a very informative post on the Learning from Incidents website that focused on making sense out of incident metrics. She outlined her evolving thinking on:

Capturing "data" through the analysis of incidents focusing on learning from them e.g., interviews, review meetings, larger discussions with leaders and experts, etc.
Periodically examining metrics that leadership pays close attention to (e.g., MTTR / incident count) and investigating the context that the data gathered from incidents provides.
Presenting the above in a digestible format to get buy-in for recommendations for future focus.

Cloud Native and DevOps for 2021: Chaos Engineering

In a recent keynote for The DEVOPS Conference, Cheryl Hung, VP ecosystem for the Cloud Native Computing Foundation (CNCF), shared her top 10 predictions for cloud native in the upcoming year. This included improvements in cross-cloud support, growth in GitOps and chaos engineering practices, and an increase in the adoption of FinOps.

Two recent CNCF projects, Litmus and Chaos Mesh, focus on enabling chaos engineering practices within Kubernetes environments. Hung states, "I think [chaos engineering] is actually a very sensible way to handle infrastructure problems and I’m a little bit surprised that it’s not already more widespread".

Why the Most Resilient Companies Want More Incidents

Companies want more incidents "because companies want more learning". According to John Egan, co-founder and CEO at Kintaba, the incident management process is meant to be a cycle of the response and the account of the root cause, and the updating of internal processes and practices across the industry.

Egan, former co-founder and product lead of Workplace by Facebook, spoke about how tech organizations were doing incident management at QCon Plus May 2021. He recommended lowering the barrier to reporting incidents, holding effective incident review meetings using blameless postmortems, and giving everyone access to postmortem data.

Gremlin Adds Automated Service Discovery to Target Chaos Experiments

Gremlin, a chaos engineering platform organization, recently announced automated service discovery. The new service discovery feature simplifies locating services to target chaos experiments. The web-based interface for the platform also shows a history of experiments performed on each service, including month-over-month activity. This view can help identify gaps in experimental coverage and which services may need more testing. It is also possible to re-run past experiments from this view.

Case Study

Designing and Managing for Resilience

In a recent InfoQ article, Dr. Laura Maguire argued that given the scale, complexity, and speed modern IT systems operate at, surprises are an inevitable part of managing digital infrastructure. Ongoing innovation, changes in company priorities, and introducing new technology into the stack means that engineers who work on continuously available services are in a constant state of learning and adapting. Because of this, well-calibrated leaders make ongoing, continuous investments in supporting their teams to safely adapt under conditions of uncertainty and time pressure.

Several studies (Allspaw, 2015¹; Grayson, 2018²; Maguire, 2020³) have closely examined how software engineers respond to surprising service outages. While the authors may have stopped short of explicitly calling them resilient practice techniques, they represent classes of strategies used by engineers to sustain resilience. Less studied, however, are the strategies used by engineering leaders to help create the conditions for sustained resilience. This article begins to do that.

"For this article, I had far-ranging conversations with five engineering leaders who work across four organizations of varying sizes and stages—from a securities exchange that launched just last year to one of the world’s most recognizable blue-chip companies. Each leader possesses deep technical expertise accumulated from years spent as an individual contributor.
The interviews were centered around a core series of questions aimed at eliciting stories, examples, and strategies of their approaches toward two aspects of their role: 1) designing an organizational structure to support resilient performance, (such as how teams should be structured or supporting coordination with non-engineering business functions), and 2) managing for resilience (the leader’s role in helping engineers teams prepare for, and coping with, surprise events).

The discussions converged into three key propositions for engineering leadership to support resilient performance in their organizations.
Proposition #1: For resilient organizations, think in terms of networks, not just teams.

Proposition #2: Resilient networks depend on active and ongoing grounding across different levels of the organization.

Proposition #3: Resilience depends on learning.

Of course, context matters greatly in the approaches to supporting resilience. What follows in the full article on InfoQ is not intended to be a prescription for leaders, but rather thought-provoking propositions intended to consider how an organization’s current practices and structures may be impeding or enhancing the adaptive capacity of its teams".

This content is an excerpt from a recent InfoQ news item written by Laura Maguire: "Designing & Managing for Resilience".

References

Allspaw, J. (2015). Trade-Offs under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages.
Grayson, M. R. (2018). Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems [Masters in Integrated Systems Engineering]. The Ohio State University.
Maguire, L. M. D. (2020). Controlling the Costs of Coordination in Large-scale Distributed Software Systems (Doctoral dissertation, The Ohio State University).

To get notifications when InfoQ publishes content on these topics, follow "resilience", "chaos engineering", and "continuous improvement" on InfoQ.

Missed a newsletter? You can find all of the previous issues on InfoQ.

This edition of The Software Architects' Newsletter is brought to you by: 

 

Leadership’s Cloud-Native Cookbook

DevOps’ great contribution to IT is treating culture as programmable. How your people work is as agile and programmable as the software. Executives, management, and enterprise architects — leadership — are product managers, programmers, and designers. The organization is leadership’s product, and they should also apply the small-batch process to its creation and growth. They pay attention to their customers — the product teams and the platform engineers — and do everything possible to get the best outcomes, to make the product—the organization — as productive and well designed as possible.

Learn more about this topic with our free O’Reilly eBook 'Monolithic Transformation'.

Upcoming Events

Discover events for senior software engineers by senior software engineers

Discover new technical insights from software leaders pushing the boundaries. Attend QCon Plus software conference this November.

QCon Plus features 12 tracks curated by domain experts to focus on the topics that matter right now in software including Microservices, Cloud Computing, JVM, Security, Software Architecture, Developer Experience and more. Register before July 31st and take advantage of the lowest price for the event ($499).

InfoQ Live on August 17th: Learn how to deploy service mesh into production.

Join Global Field CTO at solo.io Christian Posta at InfoQ Live and get practical guidance on how to adopt a service mesh for your organization including separating out control plane and data plane, plugging in with observability tools, leveraging gateways appropriately, and overall preparing for troubleshooting and debugging. Book your spot for $19.95.

InfoQ Live September 21st: Find out how to apply Containerized Applications to improve application speed, reliability and deployment.

Software development teams working on bringing new ideas to market faster rely on the use of containers and container orchestration. Get valuable insights on how automating the deployment, management, and security of containerized applications can help you scale and respond sooner to customer demand. Book your spot for $19.95.

Senior software developers rely on the InfoQ community to keep ahead of the adoption curve. One of the main reasons software architects and engineers tell us they keep coming back to InfoQ is because they trust the information provided and selected by their peers.

We’ve been helping software development teams adopt new technologies and practices for over 15 years through InfoQ articles, news items, podcasts, tech talks, trends reports, and QCon software development conferences.

We hope you find this newsletter useful. If not, you can unsubscribe using the link below.

Unsubscribe

Forwarded email? Subscribe and get your own copy.

Follow InfoQ.com on

The Software Architects' NewsletterJuly 2021View in browser