Shutting the gate on entity updates
Extending a “locking” primitive to allow updates on given primitives in ${N} directions
Extending a “locking” primitive to allow updates on given primitives in ${N} directions
Recently I had the chance to work with the at Global Fashion Group in Berlin to help them work on shared services across ventures such as the Iconic, Zalora, Dafiti and Lamoda. This business operates at several orders of magnitude more scale and has grown aggressively in its fairly short time to become among the the largest fashion companies in the world.
This has been an interesting experience and one that, on the whole, I am enjoying immensely. With the new job comes new technical challenges, one of which is the challenge of “scale”.
The platform pushes through an obscene amount of orders largely scaled out by applying a “sharding” pattern; doing the authoritative reads / writes in the same region the sales are executed. This has scaled quite well, but is now at a point where traffic demands are high enough per region and showing no sign of slowing down. The software is being re architected to support this much higher demand.
Scaling Further
One of the primary ways in which we can ensure a responsive service is by identifying types of work that does not need to be accomplished immediately (i.e. within ~1s) but can instead be accomplished “in the background”. This sorts of work may be reconciliation, complex validation, coordination with other services and so fourth.
There are a few patterns for this:
Queues (RabbitMQ)
Broadly, a queue that receives some notification of change or specific command to execute and does the work when it reaches the end of the queue. It is typically extremely cheap to write to the queue and the queue can be configured to process work at any given rate.
However, when work is in the queue state the work is in an “unknown” state. It is “in progress” and remains “in progress” for some period of time, until it resolves.
If the updates that are in the queue do transformations on any given entity and the ordering of multiple queues in the system is not kept in tact, the system can reach a state that the user consuming the system did not anticipate.
Distributed Commit Log (Kafka)
A commit log is conceptually similar to a pub/sub queue but items are never really “consumed” from the queue. Rather, the commit log simply tracks the command or state change and leaves the vast majority of the work to the consumer.
In this sense it is much like the MySQL binary logs — they might be read or they might not, but MySQL doesn’t care.
This shares many of the same problems as the queue, though it allows keeping application state much more easily as it enforces order (under specific conditions).
Regardless, the update is in “no mans land” until it is processed by the consumer.
Reconciliation (Control Loop)
Comfortably my favourite way of shifting work to the background is via a reconciliation loop.
Both the commit log and the queue presume that the ${WORK} to introduce the state change will be expensive. Indeed, that is correct by default — “dropping a queue” in front of a service is a simple solution to make long running work reliable, but it is not the only solution.
Instead, we can introduce a small state change to the object itself and let a control process read that state change and apply the larger, more expensive state change to the process.
This is commonly implemented in cache structures and is known as “marking an entity dirty”; the reconciliation loop later updates the cache. Alternatively it is also present in Kubernetes in the “Node Condition”, in which the node will mark itself as “Ready=False” and the control loop will shift work off that node such that it can recover.
Opening closed gates
Note: Code samples are written in “protobuf”; See the protobuf docs for details. The examples are deliberately made up and do not reflect real world business conditions.
Consider a “person” entity:
// A person entity
message Person {
string name = 1;
int32 id = 2; // Unique ID number for this person.
string email = 3;
}
We can create a CRUD API for this person fairly easily and there should not be any scaling problems; it's nice and cheap to add this person to a table in MySQL.
However, consider the business requirement “only users associated with ${VENTURE}
“. That will require making a call out to the person service of that ${VENTURE}
which might be around the world! That might take seconds. Enough of these requests and we stack up to minutes and nasty scaling concerns.
However, the person doesn’t need to be added straight away. They also need to do email validation and so fourth. So, we do not have to reply straight away with our new user — we can simply write the user in and mark it as “some work required”. Practically the definition might look like:
// A gate entity
message Gate {
// The type of change that needs to be completed before the person is "good to go"
string name = 1;
// The transaction in which the gate was created
string txn_id = 2;
}
// A person entity
message Person {
string name = 1;
int32 id = 2;
string email = 3;
// What blocks the user
repeated Gate gates = 4;
}
Which can be created by:
{
"name": "Andrew Howden",
"id": "c78a033c-a7ae-11e9-a71e-e7787f36f4f1",
"email": "my-email@gfgtech.com",
"gates": [
{
"name": "IsVentureMember",
"tx-id": "f1fe8db8-a7ae-11e9-9620-7beb7294ae9b"
},
{
"name": "IsEmailApproved",
"tx-id": "f1fe8db8-a7ae-11e9-9620-7beb7294ae9b"
}
]
}
These gates can be quickly and easily checked by business facing code before it takes action.
For example, when that user tries to “log in” to the service the business logic might say:
if yes := user.hasGate("IsEmailApproved")); yes == true {
return fmt.Errorf("user '%s' cannot log in, is gated by '%s")
}
In the background we can have a control loop who’s only responsibility is to watch for changes to the product entity and when it sees a condition that it can help with, reconcile that condition.
The control loop will either conclude that the user is fine, and thus remove (open) the gate:
// Path is in `jq` syntax.
[
{ "op": "remove", "path": ".gates | select(name == IsEmailApproved)" }
]
Or, alternatively, it can swap in a new gate that permanently marks this user as “bad”
[
{ "op": "remove", "path": ".gates | select(name == IsEmailApproved)" },
{ "op": "add", "path": ".gates[]", "value": { "name": "UserInvalid", "txn_id": "04084006-a7b6-11e9-82c4-130b6d87b05d" }}
]
The user now has the gate “UserInvalid
”, which clearly indicates there’s something a bit funny going on. Another process can watch for users who enter this condition, notify them via email and then delete them.
Gate Limitations
That said and done there are some limitations to using the “gate” primitive to track state changes in a system.
Indefinite
Unlike queues or a commit log which has some notion of expiry, a gate generally does not. This means that if a system creates a gate and then later discards that type of gate, there is nothing there to “clean up” that gate.
To address this, gates should include the notion of “expiry” — that is, when a certain time has passed that gate can be dropped. Further, when a gate opening (removal) is attempted the expiry should be updated — regardless of whether the open was successful.
This must be used carefully, and only depending on the nature of the work. It might be suitable for certain kinds of validation or for third party systems, but not suitable to prevent a user from logging in.
Undocumented API
Unless careful, gates can become an undocumented API. While third party systems should be allowed to read and write gates, systems should not depend on gates maintained by a third party unless the API maintains the gate in some official documentation or specification.
Unusual
To my knowledge, the gate primitive has never before been expressed in this. It is similar to a lock, but multi dimensional. Accordingly it may take some time to get used to.
Gating all the things
Gates allow the opportunity for extensibility fairly easily as they’re essentially a dirty indication with direction.
Given this, they pose a few unique benefits:
Cheap
Gates make for cheap writes. Gates should be extremely quick to read and write to, and they balance the idea that there is “processing” happening on a given entity, but that the processing is not yet complete.
This means nice and fast APIs, and a good model for doing work in the background.
Extensible
Gates can be extended over time to include an almost arbitrary number of different processes. Given entities that are complex and might require several layers of validation or transformation (for example, a product) a gate allows each process to maintain its own gate, and for those processes to be executed in parallel.
It even allows third party systems to insert their own gates, and for the application to be “blind” to what the third party system requires.
Reapplicable
Gates can go from being “not applied” to being “applied” as a result of other system events.
For example, when a person is updated they might have applied a whole set of gates indicating that the updated person should be put through validation again with their new data.
Idempotent
Gates are idempotent. They can be applied or deleted in sequence, and there is no harm to “reapplying” an existing gate.
In Conclusion
After joining the new company I have found a set of new, and interesting challenges. One of the challenges is helping our partners experience a quick and easy system such that their work is much easier, and to shift as much of the “heavy lifting” work into the background away from user interactions.
Gates provide one mechanism to do this, and a mechanism that has some unique benefits (as well as some costs). They might be useful to you, and to others in the organisation as a model for entity state management.
Thanks for time ❤