DDD East Anglia 2019 In Review
On Saturday, 21st September 2019, the sixth DDD East Anglia event took place in Cambridge at Hills Road Sixth Form Centre.
This is the same location in which the event was held last year and for the previous few years running. As in previous years, there were 4 tracks of sessions, but this year the session lengths were varied, allowing for some shorter 30 minute sessions as well as 1 hour sessions which allowed for 6 sessions in each track rather than 5 throughout the day. This was an excellent idea and allowed for much more variety in the sessions. This meant that this year's line up was better than ever.
After having travelled down to East Anglia the evening before, I only had a short 20 minute car journey to get to Hills Road in readiness for the start of the conference. Although I set off in ample time, some unexpected road works causes a slight delay in my journey, but I made it to the reception desk of the conference just in time.
Unlike in previous years, this year I was actually speaking at the conference! I introduced myself at the reception area and was shown to the speaker's room where I could base myself for the day if I chose. I quickly grabbed a coffee from the adjoining music hall and made my way to the speaker room to collect my speaker's t-shirt and prep myself for the day ahead. Once suitably adorned in the fetching speaker's shirt, I made my to the first session of the day, as despite being a speaker at one of the sessions, I intended to catch as many other sessions as I could, just I would do if attending the conference as a non-speaker. I'd decided that the first session for me would be Tom Wright's I was a frog being boiled by CosmosDB.
Tom starts by introducing himself and sets the agenda for the talk. He was previously employed by a company called HeadUp Labs based in Australia and building a mobile application to help analyse data from various health wearables. Tom's talk will be about the issues that he and his team encountered whilst trying to scale Azure's CosmosDB database. He says that the agenda will cover firstly dissection of the problem, practical approaches to solve the problem and finally reflection on the final solution.
Tom first introduces CosmosDB and states that it's a multi-paradigm database. It doesn't directly support T-SQL or ANSI SQL, but does have a SQL-like syntax for querying and can also be accessed programmatically via a MongoDB compatible API. CosmosDB stores documents and Tom tells us that a CosmosDB is effectively a "soup of JSON documents". It has provisioned throughput meaning that you only pay for what you use. These provisioned units are called RU's which are Request Unit Per Second and this is affected by the size of the JSON documents stored.
Tom show us his initial architecture for his application and highlights the fact that he has a fairly standard caching layer between the Repository and the CosmosDB instance, but the Azure Functions bypass the cache and talk directly to CosmosDB.
Tom then explains the usage scenarios for the application, which entail new users connecting to the application, the normal flow of traffic during the working day and a "daily processing" check-in which occurs at 5pm each day. As can be imagined, the 5pm daily check-in for all users caused a huge spike in traffic and was causing CosmosDB to generate 429 (RequestRateTooLarge) errors due to exceeding the amount of usage that was provisioned within CosmosDB.
Tom explains that it's normal have to provision for the top of largest peak of load, but this requirements can be reduced by spreading load to reduce peaks. To spread the load, Tom used techniques including Time zones - ensuring load was offset based on user's time zone, periods of time - ensuring the 5pm check-ins actually can happen anywhere within a window of time around the 5pm point, cohorts - ensuring that users are split into "buckets" and each bucket is processed separately, and jittering - which ensures that random offsets are applied when data is transferred.
We explore some further ways to optimise queries against CosmosDB. Tom says that it may seem obvious, but always use numeric fields when ordering queries. The amount of compute required is far less than for non-numeric properties. Implementing User Defined Functions (UDFs) is another good optimisation technique and these can be embedded inside of collections within CosmosDB and referenced by queries, improving load and reducing the RU's used. Tom states that he also introduced his own custom index documents rather than relying on the indexes automatically produced by CosmosDB. This allowed splitting some queries into two smaller, simpler queries than having one large, complex query. Although this caused two round-trips to the CosmosDB, it actually reduced the overall RU usage due to each query being simpler and easy to fulfil by utilising the custom indexes. The custom indexes were kept up to data by ensuring the index was updated after every write operation, ensuring all reads could make use of the up-to-date indexes.
We next look at Azure Function bindings. Tom states that the reason his Azure Functions had to talk directly to CosmosDB rather than to the caching layer was that he couldn't directly bind to the cache layer from the Azure Functions as, at the time Tom was developing the application, no ability to do dependency injection on Azure Functions existed. It was achieved with a 3rd party library which provided some success, but didn't totally solve the fundamental problems with CosmosDB.
Tom's next steps were to try tweaking his retry policies for write requests, essentially continuing to retry again and again in the face of failures. This didn't work either so the next step was to try to profile CosmosDB itself. Tom found that nothing really existed inside of Azure that would provide the level of metrics that he was looking for so Tom and his colleagues built their own tool for this. After much examination of the resulting data, however, they were no closer to solving the CosmosDB problems.
At this point, Tom and his team decided to try tweaking the default options provided by the .NET SDK to see if this would improve performance. Unfortunately, this didn't help. The next step was to split out updates from the documents that were being saved in CosmosDB. Tom explained that, if a document has one specific field (or property) within the JSON document that was considered very volatile (i.e. the value could change more frequently than other field on the same document), it would be split out into a separate document. This approach was taken as CosmosDB's "updates" are really a delete and a re-creation of the document due to the fact that only complete documents can be written. This provided some improvements, but didn't fully address the underlying issues.
Finally, after scouring through Microsoft's own CosmosDB documentation, and re-examining the application code line-by-line, something jumped out. Tom and his colleagues found that they were invoking DocumentClient.OpenAsync();
in every query. In reading between the lines of the documentation, it was discovered that this method retrieves the complete address routing table (this is essentially how the SDK knows where to write specific data into specific partitions within the database) from the CosmosDB database every time it's called. Moving the call to the OpenAsync()
method out of each query call and ensuring it's called only once fixed all of the performance problems. Tom explained that this was never discovered when testing against the test environment as there was significantly less data in the test database, and thus significantly less partitions and thus a much smaller routing table. It was only when the application was deployed to production, where the data was many times larger, that this performance issue raised its head.
Tom concludes by reminding us how we can avoid "being a frog". Trust your gut, don't trust your code. Make sure if you're testing, you're testing against equivalent code and data loads between environments, and be especially aware of cause and effect of changes being temporally separated - that is when the effect of a change is not noticed immediately, but only after the passage of some amount of time. This makes it very hard to track down the exact cause of changed behaviours. Tom has a blog post that fully describes his entire approach to CosmosDB tuning.
After Tom's session, it was time for the first break. I quickly heading to the music room where refreshments were being served and after grabbing a quick cup of coffee, I headed back into the main building to the speaker's room. After only a small amount of time there, during which I didn't even finish my coffee, it was time for the next session. This timeslot was the first slot of the day to contain 30 minute sessions. For this one, I opted for Josh Michielsen's Rapid API Development In Go.
Josh starts his session by introducing himself, he's a Senior Cloud Platform Engineer at Conde Nast. He proceeds quickly to introduce the Go language. It's a clear and concise language with a robust standard library, but the language itself is fairly lightweight with not too much built-in directly to the language itself. It's a relatively new language, originally designed at Google and it's garbage-collected, compiles to a single executable file, but most importantly is designed to directly address the high-performance networking and multi-processing required by today's software applications.
Josh tells us that Go has been adopted by a number of significant software products since its release including Kubernetes, Prometheus, Many HashiCorp products and Docker.
We start by looking at a really simple Go program:
package main
import "fmt"
func main() {
friend := "DDD East Anglia"
fmt.Printf("Hello, %s", friend)
}
Josh explains that Go has type inference, so the friend
variable in the code above is inferred to be a string
due to the initialized value. Josh also tells us that Go functions can have multiple return types:
func vals() (int, int) {
return 3, 7
}
func main() {
a, b := vals()
fmt.Println(a)
fmt.Println(b)
}
The above code shows how we can return two integers from a function call and then use multiple assignment to capture each of the two returned values into two different variables.
We then look at one of the powerful features of the Go language, which revolves around concurrency. Go uses things called "goroutines" which are effectively lightweight threads that provide a simple concurrency primitive. There's also the notion of "channels" which are types that provide safe communication between goroutines. We can easily launch a new goroutine using the go
keyword:
package main
import (
"fmt"
"time"
)
func say(s string) {
for i := 0; i < 5; i++ {
time.Sleep(100 * time.Millisecond)
fmt.Println(s)
}
}
func main() {
go say("world")
say("hello")
}
The above program will spawn a goroutine using the function say
with the value of world
, whilst the same function is used with the value of hello
outside of a goroutine. The results of running the above program are that we will see multiple copies of both values interspersed between each other, due to the goroutine running separately from the invocation of the say
function from the main()
function.
We use channels to communicate data between goroutines:
package main
import "fmt"
func main() {
messages := make(chan string)
go func() { messages <- "ping" }()
msg := <-messages
fmt.Println(msg)
}
Here, we're creating a channel of type string with the call to the built-in make
function and assigning it to a variable messages
. We can "pass in" values to the messages
object that will communicate those values to the goroutine in a safe and synchronous manner.
Josh tells us that Go does not contain any generics, although this is a feature being considered for a future version of the Go language. We can, however, approximate generics in some areas, since we can create (for example) a channel of a specific type. In the above example, our channel is created from the built-in chan
type and specified as being a channel that accepts data of a specific type, in this case the channel is only accepting data of type string
. This is effectively the same as if we were able to declare our channel as a generic type that may look something like the following (in C#), chan<string>
.
Josh shows us some simple demos in Go and shows how Go is compiled down to a single executable binary file. Go binaries can be statically linked supporting different architecture, with cross-platform compilation built-in! Josh talk about the fact that Go enables easy web development as it contains a net/http
library that can be easily included in any Go application. This provides developers with the ability to implement handlers which accept a request and response as the function parameters allowing for custom behaviour inside the function and complete control over the response generation.
Josh tells us about some useful third-party libraries that can be used with Go to help with development of web-based applications. There's the Gorilla Toolkit which provides a number of libraries that simplify web and API development. There's the HttpRouter project providing a high performance HttpRouter which improves upon the default Http request multiplexer within the built-in net/http
library. There's also a few microservices frameworks. There's Gokit and also Micro.
Finally Josh tells us about some great further resources to learn more about Go. There's a great tutorial, Go By Example which walks through language features one-by-one, the official Go website's documentation contains two excellent resources in How to write Go code and Effective Go. There's also the book, Go by Example.
After Josh's talk it was time for another refreshment break. I quickly grabbed myself another cup of coffee and headed back to the speaker's room as my session was in the very next time slot. My talk was to be An Introduction To Domain-Driven Design and was another one hour session.
After consuming my coffee and preparing myself and my laptop with the slides I'd need for my talk, I headed the short distance down the hall to the room where I'd be presenting. I setup my laptop and connected to the projector and prepped myself for the attendees (if any!) that would soon be arriving for my talk.
Well, I can't write up notes here from my own talk, of course, but for those interested, here are my slides from the talk.
After my talk was over, it was time for lunch.
Lunch was served in the adjoining music hall and was the usual brown-bag style. I selected a very nice falafel, houmous and roasted red pepper baguette along with some crisps, a chocolate bar, a snack bar and a soft drink. There was the usual array of lightning talks that often take place over lunch time, however, I was still on quite a high after delivering my own talk that I felt I needed some quiet time to come back down to earth and relax whilst eating my lunch, so I headed back to the speaker's room with my lunch in tow.
After lunch was finished, it was time for the first of the afternoon's sessions. There had been a number of submissions to DDD East Anglia on various subjects, and specifically there was a number of session on Domain-Driven Design. Luckily, my session was one of those highly voted for and another session on DDD was also highly voted for by the attendees. This was also to be my first session of the afternoon and it was Ian Russell's Strategic Domain-Driven Design.
Ian starts by stating that his talk is specifically about the Strategic Domain-Driven Design patterns and won't cover any of the tactical - which are the more technical - patterns of DDD. He shares the agenda for the talk and says that we'll first have a quick overview of strategic domain-driven design followed by a deeper dive into the various elements of strategic domain-driven design and then finally looking at some of the advances and developments from the last 15 years of DDD.
We start to look at the strategic patterns and examine Domain Decomposition. This is how we split an entire business domain into individual parts. We can start by examining company departments to do this, but there's frequently a lot of cross-over of business functionality from one department to another. For this reason, it's important to move more towards defining bounded contexts. We look at concepts that can exist in multiple bounded contexts, terms like Customer and Product are very often used in many areas of a business but they're really different concepts that mean different things to each different area.
Ian talks about the different types of mappings that we can have between bounded contexts and there's a number of different ways in which contexts can relate to each other including partnership mapping, where two bounded contexts have a close relationship of equal importance, through to a conformist mapping where one bounded context is effectively subservient to the other.
Ian summarises strategic DDD by stating that is all based around have a shared model of understanding of the domain, discovering which are the most important parts of the problem space and finally being able to effectively transform the conceptual model of the problem space into useful solution space software.
We move on to look at the advances and developments in the realm of Domain-Driven Design that have happened in the 15 years since the original Domain-Driven Design book by Eric Evans was published. Ian state that many of these developments fall into three main categories: Domain Events, Temporal Modelling and Sociotechnical Architecture.
We look the CQRS and Event Sourcing patterns. These are often used in a DDD context and enable significant benefits. CQRS is essentially the separation of read models from write models within a system allowing easier scalability of each different concern and was originally defined by Greg Young who had originally termed the pattern DDDD - Distributed Domain-Driven Design. The Event Sourcing pattern captures domain events and intent allowing the recreation of system state based on any point in time from the domain events stored. Systems using these patterns will have the concept of Commands. These are requests to perform some specific behaviour, and these Commands can fail as well as succeed. If a Command is successful, it will raise a domain event which is the permanent record of the behaviour along with any state mutation that happens as a result of the behaviour. These domain events are persisted in an event store, which is a database specifically design for the purpose of persisting event streams (one such database is the Event Store product). In order to rebuild current state (or, indeed, any past state) of the system, we simply replay the events and build what is known as a projection from the events, giving us back our domain models. Event stores can also handle things like snapshotting, which is where we can store interim projections for a given model allowing us to only have to replay events from the time of the snapshot in order to recreate state, and versioning of events, which is a natural occurrence over time.
Ian then shows us some new techniques for whiteboard style modelling with our domain experts. We look at the fairly new technique of Event Modelling, created by Adam Dymitruk. Ian states how Event Modelling improves upon many of the previous techniques of creating stategic models which were both very difficult to do well and were also very abstract. Event Modelling puts the model on a timeline and shows the relevant commands and events and how they relate to actors within the system. i.e. User requests X with system, command Y is generated, which results in event Z. We can get started with Event Modelling by first brainstorming all the likely domain events. We then create a timeline for the various user stories and processes which will include identifying the inputs and outputs of the system as well as mock input screens and wireframes. We then arrange the domain events into logical swim-lanes based upon bounded contexts.
We move on to look at Event Storming, which is another modelling technique first described by Alberto Brandolini. This is all about group modelling of domain events along a timeline. Event Storming is a key technique for spreading knowledge between those who have knowledge to give and those who wish to learn. Event Storming sessions contain both domain and technical experts along with a facilitator who run the session. Ian says that the facilitator role is vital to keep the session on point and focused. There are a number of different types of Event Storming, but perhaps the most popular is that of "Big Picture" event storming which takes a very holistic view of the entire domain and attempts to model the entire end-to-end business via domain events that occur along the timeline of a typical customer interaction. Event Storming is often the best first approach to get domain experts involved with domain modelling as it's very quick and easy to learn and to get started, requiring only a large wall adorned with paper, some marker pens and a lot of sticky notes as well the willing participants. Event Storming models allow the surfacing of a large amount of valuable domain knowledge in a short space of time.
Ian tells us about another fairly new modelling technique called Domain Storytelling. This technique also uses a timeline, but the timeline is defined via the interactions of the various elements within the domain, such as actors (i.e. users), commands, events and behaviours as well as the physical objects that might be found within the domain such as documents, computers, external systems etc. The numbered interactions between the domain elements are annotated to provide the "story" of the domain.
Finally, we round off the session by taking a quick look at sociotechnical architecture. This is a fairly new observation by a number of people, including Nick Tune, who states that:
Organisational politics will be mirrored as architectural complexity.
This is also discussed in the book, Accelerate by Nicole Forsgren, Jez Humble and Gene Kim. This book is focused on the techniques that should be used to build and scale high-performing technology organisations, and within the book the author's state:
A loosely coupled software architecture and organisation structure to match is a key predictor of Continuous Delivery Performance & Ability to scale the organisation and increase performance linearly.
After Ian's session was over, it was time for another refreshment break. I felt that I'd had my fill of coffee for the day and so returned to the nearby speaker's room for a short break before heading to the next session of the day. This was another 30 minute slot, and I decided that for this session, I would attend Nokuthula Ndlovu's Emotional Intelligence.
Nokuthula first starts by talking about how workplaces were, historically, somewhere where your personal problems were left at the door when we went into work each day. Today, however, the world of work has changed and this is no longer the case. It's said that today's employees will being their "whole self" to work and into the office. As a result of this, there's much more of a focus on the mental health and wellbeing of staff. Businesses care more than ever about how their employees feel and it's been proven by a number of studies that our general sense of well-being has a direct impact on our productiveness at work.
Emotional Intelligence (also known as EI or EQ) is being able to identify and manage one's own emotions as well as the emotions of others and using this awareness to manage and alter your interactions with others. Nokuthula asks us to think about how we would feel if she says "Hello and welcome everyone" (which she then proceeds to do in an enthusiastic and bombastic way!) versus how we might feel if she spoke the same words, but in a more much quiet and subdued way. Our perceptions of her and her demeanour would be greatly affected by the approach that she takes. Emotional Intelligence is separate and distinct from our Intelligence Quotient (IQ), which is defined and fixed from birth, however, our EQ can be developed and improved with effort, over time.
Nokuthula tells us that there's proven studies that show that increase emotional intelligence is directly linked to success, and that the World Economic Forum in their "Future of Work 2022" report has Emotional Intelligence within the Top 10 things required or that will define work in the future.
There's real neuroscience behind EQ. Our brains have a Limbic region, which defines the emotional reactions to external stimuli as well as a Frontal Lobe, which is the intelligence region of the brain. These two regions communicate signals which are generated in response to stimuli. It's the communication between these two areas of the brain that determines an individual's EQ. The Limbic region controls our "fight-or-flight" response. It receives signals 5 times faster than the Frontal Lobe region of the brain, which is the region that allows us a more considered and intelligence-based response to the stimulus. Improving EQ is all about learning and training ourselves to not respond with our first reaction, which is likely driven by the Limbic region, and to allow full processing by the Frontal Lobe region before responding.
Emotional Intelligence has four foundational skills. These are Self-Awareness, Self-Management, Social Awareness and Relationship Management. Nokuthula takes us through each of these in turn. Self-Awareness is our ability to have a clear perception of our own personality. Our strengths and weaknesses. It's also about understanding other people and how they perceive you. We look at some exercises for improving self-awareness. One simple exercise is breathing. Simply close your eyes, slowly breathe in and out. Allow the brain to wander, but bring your thoughts slowly back to thinking "in the moment".
Self-Management is dependent upon self-awareness. This means that the more self-awareness you have, the better you can manage yourself, your thoughts and actions. Self-Management is about our ability to decide whether to act or not in response to a stimulus. It also includes resilience, which is how long it takes us to recover from a "peak" of upset to normal behaviour. This also extends to our adaptability and how we handle the crises of general day to day life. Good self-management means that we're sensitive and can handle stress well and that we listen carefully before responding. Verbal outbursts or the radiation of stress to other people is indicative of a need to improve self-management. A good exercises for improving self-management is the "traffic light" exercise, which involves identifying situations and equating them to a traffic light. When we're upset, that's a red light and so we stop and pause for thought. We move then to an Amber light which requires that we think about how we're feeling, our available options for proceeding and the consequences of each option. Only after this can be proceed to a Green light, which allows us to act.
Social Awareness is our ability to understand how we might react in different social situations. Good social awareness means we can not only identify this, but can modify out interactions with other people in order to achieve the best results. There are two paths to social awareness, Empathy - our ability to understand and share the feelings of others, and Organisational awareness - our ability to read the emotional currents and power relationships within a group of people. A good empathy exercise is the "just like me" exercise. We simply continually remind ourselves that others are "just like me" - human beings with feelings, thoughts, fears and desires. An exercise for organisational awareness is the "15 minute tour". We walk around our workplace (or other occasion with a group of people) and simply examine and observe the interactions of each other person with those around them.
Finally, Nokuthula tells us about the relationship between stress and performance. A small amount of stress can be beneficial and help to improve performance, however, too much stress can lead to anxiety, panic and ultimately burnout and breakdown. The Yerkes-Dodson curve identifies the varying levels of stress and how we respond to them. Having a high level of Emotional Intelligence can help us to identify when we're subject to too much (or too little) stress and thus enable us to mitigate the negative consequences.
After Nokuthula's session, there was a final refreshment break before the last session of the day. This last session was another hour-long session and for this one, I chose Martin Beeby's Patterns For Resilient Architecture.
Martin starts his talk by telling us that building distributed systems is hard. He says that the first rule of all systems is to remember that failure is normal. We can't entirely prevent failure, but we can use various techniques to help to prevent or limit the potential damage caused by failure from impacting our systems to heavily. He shares a quote from Werner Vogels, the CTO of Amazon.com, who says:
Failures are a given and everything will eventually fail over time
Martin shows a picture of some of the core services used by Amazon.com to highlight just how complex distributed systems can be. Amazon.com are considered just one of many customers of AWS.
AWS's core requirements for their business is that of availability and reliability. They have to architect their systems such that, even if their own systems fail, their customers don't perceive the failure and the customer's systems continue to work normally. Martin says that AWS's internal systems have to be resilient. This is the ability to recover from failures. He says that AWS's internal systems are constantly failing, but because of their resiliency, they ensure that their customers don't feel this.
Martin talks about how such resiliency can be achieved, and we look at the concept of "availability in parallel". This essentially means that if we can run one thing (server, machine etc.) with a 99% uptime guarantee, then running two or three copies of this same thing in parallel will enable us to provide users of the service with either a 99.99% or even a 99.9999% uptime guarantee, even though each constituent element remains with only a 99% uptime guarantee. A 99% uptime guarantee sounds very high, but can actually equate to just over 3 days worth of downtime in a year. A 99.99% uptime guarantee, which is provided by running two services in parallel, equates to a maximum of 52 minutes of downtime per year, and a 99.9999% uptime guarantee equates to only 31 seconds of downtime per year. Martin talks about performance as well as resilience and that it's vital for AWS's biggest customers. He shares a quick statistic regarding Amazon.com and says that if Amazon.com's home page take even just 100ms more to load, Amazon lose 1% of daily sales.
We look at a slide that attempts to put a monetary value on failures and downtime, and we can see that in today's online world, the cost of downtime for certain organisations can be huge.
Martin talks about AWS and its data centres. Spreading customer load over multiple different data centres is way to both ensure availability in parallel, but also a way to mitigate risk and reduce the "blast radius" of failures. A failure blast radius is defined as how many customers a given failure can affect. The fundamental concerns of AWS data centre engineers is how to get sufficient internet, power and water into the building. AWS's data centres are contained inside of something called an Availability Zone (aka AZ), and Availability Zones are contained inside Regions - which are geographic regions on the planet. There are always at least two Availability Zones is a single Region and there are always at least two data centres inside a single Availability Zone. The distance between two data centres has to be carefully planned. They're located physically apart, usually by only a couple of miles. This is far enough to provide some resiliency against certain natural disasters that may occur, but not far enough that it would introduce latency between the fibre connection that connects all data centres in an availability zone. Martin states that all AWS Regions are not just physically separate from other regions, but separate in a technology sense too. No dependency between regions helps reliability.
We move on to look at isolation and containment patterns as these improve resiliency by reducing a failure's blast radius. We look back at how such containment was accomplished in other domains, such as on a ship. Large ships use the concept of bulkheads, which compartmentalise a ship's hull, ensuring that a major leak in one area of the ship can be contained to only a small section of the hull, allowing overall buoyancy ensuring the ship does not sink. Martin says its important for us to design our applications to use the same metaphor. Martin talks about "black swans". These are events are are very rare and hard to predict. We sometime know them as "poison pills" and are inputs to our systems that, whilst not malicious, just by sheer chance manage to highlight specific bug or error in our system. Since it's so hard to predict or architect for such an eventually, it's all the more important to contain any such failures that occur as a result of such inputs.
Martin talks about a technique that's heavily employed at AWS relating to how customers are balanced across services. This is crucial to ensure that service load is spread as evenly as possible across all service nodes, but also to ensure resilience in the face of failure. The technique used is that of "shuffle sharding". One naive way to distribute customers across service nodes is to simply randomly assign each customer to one of the nodes. This works surprisingly well for the most part, but is susceptible to leaving some service nodes underused whilst others get overloaded. It also means that failures in one node can affect a large number of customers. If we had 4 nodes and 100 customer, evenly spread between the nodes, a failure in one node will still affect 25 customers. Shuffle sharding extends the idea of random assignment by dividing the number of nodes into "shards". When we assign customers, we do so by assigning them to all nodes inside a single shard. For the 4 node example, we might have 2 shards of 2 nodes each. Failures of a single node allow requests to continue to be served by the other node in the shard, and complete failure of a shard ensures that other shards remain unaffected.
Next we look at the idea of "immutable infrastructure". This is the notion that one a service is deployed into production, it never gets modified. Should modifications need to occur, for example, to install operating system or application updates, we create a whole new copy of the service from a base instance or template and this new copy completely replaces the old. The benefits of immutable infrastructure include more consistency and reliability in your infrastructure and a simpler, more predictable deployment process and helps to mitigate or prevent common infrastructure issues such as configuration drift. AWS offers the AWS Cloud Development Kit which is an open source software development framework allowing the modelling, configuration and provisioning of infrastructure using an "infrastructure-as-code" approach and leveraging some popular programming languages.
Finally, Martin take a look at Chaos Engineering. This is the concept of explicitly and deliberately injecting failure into your systems to ensure that the overall resiliency of the system is not compromised. This was very famously done at Netflix with their Simian Army set of tools, the developers of which eventually expanded to launch their own company, Gremlin offering "failure as a service". Such an approach to ensuring resiliency may appear quite extreme, but when you're operating at the scale of AWS and similar providers, whose entire business is based upon an almost uninterrupted guarantee of service, chaos engineering can be a very efficient way to discover resiliency issues.
After Martin's talk was over, it was time for the final wrap-up and prize-giving. All the attendees gathered in the main room whilst the sponsors were able to each give a brief talk before awarding prizes that they'd been raffling throughout the day. The organisers were able to thank the sponsors, the venue and the attendees for making this year's DDD East Anglia a roaring success and also awarded some prizes of their own.
There was a post-conference meetup in the pub and a meal at a local restaurant, unfortunately however, I had a long drive back to the North-West to contend with and so had to set off for home immediately after the conference wrapped up. It really was another superb DDD East Anglia and although I always say it, I really am looking forward immensely to next year's event.