Data Replication
Context
Microservices are in use or are planned to be adopted. The data model was decentralized, and the storage area per service was isolated to a certain degree.
Data of other microservices needs to be present in the microservice to serve the request.
Problem
- Further service calls would add latency to the request processing leading to a bad user experience.
- Further service calls introduce the need to handle network and timing failures, as well as dealing with the temporary unavailability of communication partners.
Solution
Use a data replication mechanism to integrate the data in an out-of-band manner instead of fetching it on-demand.
In most cases, a changelog is the basis of a replication mechanism. The concrete implementation should support the following features:
- The change events can be used to manipulate the replicated data into the form the microservices needs it to foster its view on the data (see decentralized data model).
- The delivery guarantees should be clear and well-understood since the handling of network failures is moved to the data integration mechanism.
- All used storage technologies should be supported that might emerge due to the storage isolation per microservice.
On a technical level this can be achieved by using
- message brokers (e.g. RabbitMQ)
- distributed logs (e.g. Kafka),
- event feeds (e.g. ATOM feed),
- handover database tables with an offset counter,
- replication mechanisms of databases, or
- dedicated infrastructure frameworks for data replication (e.g. Synapse L58 / originally Promiscuous).
The latency added to request processing by calling other microservices on-demand is pulled forward in time to an out-of-band data fetch and caching. In total, this might or might not be an increase in total network traffic, depending on the system context. However, we did not find indicators that the amount of data needed to be replicated is technically an issue. Economically, especially in hybrid cloud deployments, the ingress and egress traffic into and out of the cloud should be considered since they can introduce high costs.
Data replication leads to dealing with stale data and eventual consistency between microservices. However, we recommend to question how relevant this based on the understanding of the domain.
If data replication is applied throughout the whole system, it is important to determine the responsibilities for each part of the data. It has to be clear which microservice is leading on which part of the data. Only the leading microservice is allowed to update the respective data, the other microservices are only allowed to maintain read-only replications.
Maturity
Proposed, requires evaluation.
Sources of Evidence
L5:
- bounded contexts => data frequently used by a microservice is owned by another
- need to create data sharing and synchronization primitives
- to avoid overhead by data copying happening during service invocations
L12:
- Only one service could update or create the meta data
- other services can have copies from data they don't own
- be careful about synchronization with the master data as copy could be stale
L14:
- Context: otto.de
- verticals have redundant data using pull-based data replication
- vertical can deliver content without access of other verticals during the request
- Kafka as high-throughput distributed messaging system, in combination with Atom feeds.
- There exist features with push notifications => where guaranteed ordering and delivery is not required
L34:
- price for flexibility: redefine data definitions / business rules across services
- data replication by each service having own perspective on their data.
- Communication among microservices does not require REST or messaging
- user interface integration may talk to other services and involve data replication instead
- Open challenges: 9. How to real with replicated info. In service specific databases?
L40:
- need to keep data consistent with eventual replicas/centralized data store
- process vulnerable to exposure, DoS, and alteration attacks
L58:
- same data needed by different services
- data must be replicated across their respective DBs
- real time with good semantics across DB engines
- is a challenge
- not yet a good approach to do so
- Synapse: scalable cross-DB replication system
- published on GitHub: https://github.com/promiscuous-io/promiscuous
- DBs may differ in schema, indexes, layouts, and engines
- transparent synchronization of data
- between heterogenous data stores (SQL, noSQL)
- declarative specification of what data to be shared
- uses pub/sub model for sync as abstraction
- with guaranteed delivery semantics
- makes use of ORMs
- hook methods to manipulate date: before/after_create, before/after_update, before/after_destroy
- [much more features, but seems not really relevant to the core of this best practice]
- Implementation details
- at its heart uses a message broker and some semantic sugar to make it appealing
- RabbitMQ
- Bootstrapping required
- Make sure queues don't grow endlessly on failures => kill queue, rebootstrap
- version store => generation number; compensates but slows down subscribers
- at its heart uses a message broker and some semantic sugar to make it appealing
- Evaluation
- Application overhead not that significant
- Heavy load, one dependency case: adds latency, but within 4-6ms => not noticable to end user
- more dependencies => overhead \<10ms with 20 dependencies
- 173ms for 1000 dependencies
- real applications remain a low enough number to avoid causing bottlenecks
- Cross-DB throughput
- ephemerals: linearly with number of workers reaching over 60k msg/s => no bottleneck
- DBs used to back pubs/subs: linear growth with number of workers until DB saturates
- saturation when DB of slowest pub/sub reaches max throughput
- Publishers make attributes of their data available to subscribers
- subscribers: own local read-only copies of attributes
- can add attributes and publish these ("decorate")
- subscribers: own local read-only copies of attributes
- Supports agile development
- nice to experiment with new features with real data
- subscribe to real production data
L59:
- Need for architecture to include expandable number of data handling nodes
- data should be allowed to be replicated in an eventually consistent way to all nodes
LM45:
- Context: interviews and insights from multiple cases on technologies and sw quality in MSA
- C8-S12
- 45 services in 5 domains with subdomains/verticals
- data exchange by replicatin => guarantees independence
- use Kafka within verticals, but not accross domains
Interview B:
- Alternative to data replication:
- always ask other service about the data
- very much traffic
- introduces latency
- => dependency at runtime
- use-case over 3 services => all three need to be available to fulfill the use-case
- Service cut: flow within one microservice (or at least the default path)
- e-commerce example: checkout in basket
- default payment somewhere stored
- delivery address
- expectation: all required data is already replicated within th service
- flow can be handled within service
- if deviates from default path, e.g. other delivery address
- how do the data get there?
- replication?
- fetches data?
- => integration patterns used
- how do the data get there?
- => data integration
- e-commerce example: checkout in basket
- Replication
- one writing microservice
- multiple reading microservices on the data
- latency for updates => temporary eventual consistency
- not transactional anymore
- How to deal with it?
- is it relevant in my use-case?
- usually a few milliseconds
- look at click paths => not relevant from domain perspective
- but need to make sure that this really successfully happens
- timeouts
- dead letter queue
- => compensation / signal that data is outdated
- gap can often be larger than we expect from domain perspective
- .... forgot it
Interview D:
- Data integration
- can have 0 hops between services during outside-request
- requires supporting service cut
- called self-contained system / independent service architecture
- replicate data => out of band sync
- timely decoupled data exchange, possibly batch
- sync on their pace
- can have 0 hops between services during outside-request
- Data replication instead of synchronous comm
- avoids problems like omission failure and latency / timing failures in general
- avoids dealing with temporary non-availability
- => makes implementation easier
- recommends for enterprise companies!
- Log for data replication
- local log
- ATOM feed
- Kafka
- DB handover table, other side remembers last processed key
- => order criteria
- Implementation doesn't really matter!
- Data replication: overhead?
- no problem if approached smartly
- esp. in hybrid cloud: ingress and egress traffic = high costs
- need to pay attention where data is located, otherwise could be very expensive
- didn't experience this as a "killer", but as area of risk
- bigger problem of approach is established way of thinking from past
- data was expensive => so many data, we don't want this; before trying.
- technically quite easy to solve
- Trivago: data in Kafka Streams
- everyone can access => early version of a data mesh
- nowadays no data lake, but data mesh
- Classical replication also a possibility
- leading DB (maybe even transactional); and a bunch of read-replicas
- replicas only have a few read-only tables replicated
- leading DB (maybe even transactional); and a bunch of read-replicas
- Trivago: data in Kafka Streams
- replication => no strong consistency, but eventual consistency
- need: which services are leading for which data?
- formerly: one system leading for whole data it has
- now: other reconciliation mechanisms how replicated
- question which system is leading broken down to attribute/row level
- e.g. service for orders of private customer segment, other service for other orders, delivery addresses the logistic system, name and address the self-service portal.
- replication as part of fault tolerance
- factors: which replication mechanism, which storage technology
- improved availability by replicas?
- is the db eventually or potentially consistent?
- many DBs in base config are not eventually consistent!
- at least quorum-based read and write
- otherwise concurring writes can lead to inconsistencies
- need to have deep understanding there!
- factors: which replication mechanism, which storage technology
Interview E:
- Context: fault tolerance in frontend
- When sth comes from backend => not that many possibilities to react
- in the least cases: complete offline functionality: complex, resolve conflicts
- mostly in smaller use-cases, not a complete fincancial accounting system
- backend for frontends help: own data managenemt
- reply without having all other dependent microservice available
- => cached alternative or as default
- message-based cache actualization
Interview F:
- Context: on-demand data fetching or out-of-band data replication?
- context dependent:
- how old are the replicate data? Are they too old? What happens if they are?
- => if I know the answer to these questions: fetch on-demand or replicate
- how long does it take to get the data? If takes long ,e.g. computation, replicate ahead of time
- every service can have their own view on the data
- Context: what if another service is not available?
- data replication: async events to build own data basis
- potentially work with data that is out of date