Kafka Event Streaming to the Rescue! Reflections from building a global FX platform
Our engineering culture is richer for having embraced an event driven architecture in several unexpected ways.
The use of event streaming improved our engineering agility whilst helping to stabilise production by providing a more observable and supportable platform. Much to our surprise Apache Kafka has helped empower innovation and enhance developer experience.
The Airwallex foreign exchange (FX) platform recently celebrated its second birthday. Over these years, we have successfully established a distributed FX pricing and settlement capability composed of more than 60 microservices (Kotlin + Kubernetes + Kafka) which are distributed across three regions and two cloud providers.
We run four self managed, cloud deployed Kafka clusters, each of which contains three brokers. Our busiest cluster comfortably handles a peak daily load which is typically in excess of 1100 events per second.
On reflection, the choice to pursue event streaming as a foundation to drive our system evolution was a good one. The obvious advantages of reduced application coupling and observability — that justified the original decision — have also been complemented by some unexpected, somewhat accidental, positive discoveries.
Improvements have been realised across the developer experience, platform stability as well as our groups’ engineering agility. I believe a reduction in engineering friction and support for the parallelisation of non-critical workflows has helped to foster a culture of innovation. But more on this later.
Adoption of Apache Kafka
The moment of truth had arrived… But what was the truth? Unfortunately, we hadn’t designed our system to tell us. Event Streaming to the Rescue!
The ‘moment of truth’ is a reference to the maiden deployment of our FX pricing application. At this point, we were yet to adopt an event streaming approach across all corners of our stack.
Initially, our application was dependent on a cocktail of disparate tooling; rabbit MQ for inbound price consumption, postgres powered state management, hazelcast coordination across instances… and redis.
Notably, these tools were used on the critical path for the distribution of prices across our business.
Our initial services were burdened with multiple infrastructural dependencies
As a result, there was a high cost to developing against, configuring, maintaining and supporting many infrastructural dependencies.
Development environments and test automation pipelines were overly burdened, and our engineers were faced with a significant technical surface area to master. In pursuit of a lighter approach, we turned to Kafka .
Event streaming seemed like a natural fit within the FX domain for the movement of money across borders; a client transaction can be modelled as an ‘event’ and conveyed through a series of transformations and aggregations resulting in some physical settlement.
A persistent event store seemed like a good fit for this use-case. But adopting Kafka proved to be liberating in ways we hadn’t anticipated.
1. Developer Experience
A persistent event store is very compelling. Kafka presents a powerful collection of features that support a diverse ecosystem of use-cases. With the adoption of Kafka we now had;
a natural replacement for our rabbit event queue
a simple alternative to our messy postgres state rehydration (via offset rewinding)
convenient state sharing across components (multiple consumer groups and state stores)
a partitioning scheme to support coordination and convenient horizontal scaling, without the need for hazelcast orchestration — and without a distributed lock in sight. 🎉
Kafka became a universal dependency, helping to simplify our application footprint by eliminating extraneous points of failure from our critical business workflow. As such, managing a single dependency reduced burden from both unit and integration testing. It has opened the door to reusable patterns and libraries (we’d love to share more on these another day!) and also proved to be effective in helping us scale across the globe.
Kafka is a jack of all trades but is also a platform that can support the mastery of a variety of different use-cases; from state sharing to horizontal scaling, event queue to replayable cache.
Disclaimer; that’s not to say Kafka is always the right tool for the job! We have made more than a few mistakes over the years and I say this as the author of an expensive ‘exactly once’ event-driven RPC alternative. This proved to be less than efficient and somewhat fragile when used to service a blocking client web request.
We have learnt the hard way; a database and traditional RPC / Rest API still have their place. Where it’s suitable, on the outer edge of our platform far away from user critical API services, we still use postgres to persist (sink) our Kafka events for long term persistence.
Event streaming yields loose coupling. By modelling the treatment of business events as a series of independent transformations, our platform evolved into an asynchronous functional pipeline.
Our platform has evolved into a functional pipeline of event transformations
Each stage of the pipeline is typically handled by a Kafka consumer/producer packaged as an independent microservice. State is carried through the pipeline in the form of events that also act as notifications to trigger subsequent stages of processing.
This ‘state on the outside ’ approach also leads to improved ‘pluggability’. Modelling business processes as lightweight event transformations made our FX platform inherently amenable to change.
The performance and stability of our APIs has greatly benefited from our use of Kafka. API latency for our query capabilities is 1ms on average. In addition, write operations to our API have a latency equivalent to that of local Kafka ~10ms on average for a single broker acknowledgement.
The reuse of events and independent offset management across multiple consumer groups, opened the door to experimentation. We have been able to trial new techniques and business workflows off the critical path in a non-destructive way.
Reducing the friction associated with building new or evolving existing pipelines has ultimately helped innovation.
Ideas can be trialled safely and beta versions incrementally improved and incubated until they have proved successful. Bad ideas can be unplugged and associated events left to be garbage collected by a Kafka broker.
Dynamic configuration and multiplexing data replication frameworks are examples of critical application infrastructure that were first trialled, and subsequently evolved off the critical path, as lightweight event streaming accessories.
Our observability and control planes have also benefited heavily from a low cost intrinsically parallel method of plugging into a common event stream.
Tools like kPow , which interface directly with the Kafka broker via vanilla consumer and admin APIs, have taken our observability tooling to the next level. We can now monitor, search for, inspect, replay, and export our ‘moment of truth’ in real-time.
The externalisation of application state into an event stream has opened the door to a convenient method of distributing our system across geographies.
Effectively, we teleport application state between remote clusters to support parallel instances of our platform.
This pattern has worked well to globally distribute our gateway API layer.
State is replicated to remote clusters which power our global API distribution
Kafka topics can be replicated into multiple remote clusters concurrently, through the use of tools such as MirrorMaker.
We follow a spoke-hub (or satellite / HQ) approach, where our read APIs (e.g. live pricing) are powered from state — in the form of an event stream — that fans out from our central HQ brain. In addition, our write APIs (e.g. trade capture) produce events that fan into HQ from the satellite clusters.
Notably, the cost of this approach in terms of WAN latency and network distribution is effectively internalised by the mechanics of event replication. This means we can deliver fast API response times across all remote regions with insulation from cross-regional network outages.
Note: the management of event replication for this use case is non-trivial and a topic that justifies a separate post.
Pluggability breeds composability
The capability offered by each remote region can be customised to fit the use-case for a particular geography. This is done by selecting from our menu of event transformations and API microservices which are all built against our universal event platform, Kafka.
Locally, our system behaviour is composed of a symphony of micro-event operations. Globally, our capability is made up of a collection of independent platform deployments that collaborate through the medium of event replication.
As Airwallex continues to scale into new markets, products and regions, we have benefitted from the agility inherent in composable event-driven architecture.
From opaque to observable, monolith to multiverse, coupled to composable; event streaming has dramatically impacted the way we approach engineering challenges.
Developer experience has benefitted from reduced friction across test, integration and deployment phases. Loose coupling and state externalisation have supported convenient horizontal and multi-region scaling, while event reuse has opened up a low-cost channel to empower innovation.
Ultimately, the use of event streaming techniques has improved our engineering agility, whilst helping to stabilise our production environment by providing a more observable and supportable platform.
Our engineering culture is now richer for having embraced an event-driven architecture.
Written by Alex Hilton Engineering Lead at Airwallex