Decentralised Web Backend
Serverless, typed graph of future internet systems
It sometimes seems like the Web got it all wrong. Centralised services control our data, traffic is routed through large corporations and walled gardens. It seems like the future looks grim, with privacy issues and ever more dependence on closed clouds. However I’d try to argue that in the coming years we might see a transition, though not with privacy and anonymity as main drivers. These could be by-products of other forces — those related to cost reductions, reliability, ease of development and financial incentives.
TL;DR: Jump to summary.
The background
- Product: “Backend, we need this service to get some mapping data!”
Dev: “Ok.. w need to crate an account and generate an API key. Is there an SDK for <language> or do we have to use HTTP directly? Right, we got some results, now we need to expose them to the other service somehow…” - Product: “Frontend, we need the user to see a map on the page here!”
Dev: <Pasts maps integration snippet>
This is an exaggerated situation (frontend integration would also need an API key and some configuration), but tries to illustrate a point. I think there is currently a bit of a gap in tooling between the frontend web development and the backend, server-side counterpart. UI can take advantage of an amazing content delivery platform and rendering engine that are the modern web browsers, together with their decent build-in dev tools. Fast-paced innovation and experimentation in JS frameworks space is a sign of a thriving platform. Mobile development is in a bit different situation, but it could be at least partially absorbed by the Web in the near future.
Meanwhile, backend development community is fragmented. Many different languages, unable to seamlessly interoperate, try to solve each other’s syntax issues. At the same time, constantly trying to keep the pace with the growth of the internet. Some open unifying activity is happening in the infrastructure layer (Docker, Kubernetes, some databases and queues), but can talk to each other at most via HTTP or TCP. Innovation, for the most part, remains behind the walls of three major cloud providers.
Additionally, coding itself (be it frontend or backend) feels quite disconnected. IDEs use the web just to fetch dependent modules/packages (obviously every platform must have it’s own package manager) which are then used offline.
Of course there are quite obvious reasons for this situation. Until recently, there was really no “frontend”. There were monolithic desktop applications, used by one user at a time. Suddenly, the Web was here and the same toolchains were expected to handle millions of internet users. Classic languages found themselves in a connected world. Sure, some of them were created together with the Web (PHP, Java, Golang), but most were built to solve a specific problem earlier.
But I think in the next decade the internet will become an operating system for the backend too (and quite possibly, the society itself). Decentralised Web is already in the thoughts of many developers, expressed in projects like IPFS and BitTorrent. But the tide might turn thanks to a perhaps unexpected newcomer, the blockchain. It has a unique ability to capture both minds and funding for it’s own growth.
But this post is not about the blockchain itself. I’m not going to mention any of the blockchain project names (it’s relatively easily to find which projects might be players in this transition). Just show a view on what could be possible.
The code
NOTE: this is highly speculative. The following examples and narrative just try to build a picture of a kind of infrastructure likely to emerge.
Let’s start with a block of code from the 2020’s or 2030’s (with a note on timescales), of some imagined C-family language. It’s just one of many, as it would of course compile to some “CloudAssembly” machine language, so that the human-readable code can match anyone’s taste.
// get a static module from IPFS network
import someLogic from 'ipfs://bar.foo/some-logic'// get type information from github
import User from 'ipp://github.com/api-graphql/objects/User'module Foo { // the map is permanently stored in a distributed key-value
// database
persistent people: Map<string, User> // Graphql github endpoint
// The protocol will make sure we have required roles and
// permissions
github: graphql at 'ipp://github.com/api-graphql' function syncUser(id: string) {
// using graphql api
user: User = await github.user(id) { name, email }
await people[id] = user
}
}
Starting from imports at the top:
import someLogic from 'ipfs://bar.foo/some-logic-v3'
This allows us to access any immutable CloudAssembly module via IPFS or some other protocol for read access to static code. That also makes our IDE fetch corresponding data schema, so autocomplete is at hand for any fields or functions from the module. By default, logic imported this way runs on the same infrastructure as the calling function.
RPC
Next, we are going to use dynamic API execution. First, we need to reference a type (schema of fields and methods) managed by Github.
import User from 'ipp://github.com/api-graphql/objects/User'
The Interplanetary Protocol (IPP) will make sure we have the required roles and permissions in Github to access this data schema and some of the corresponding GraphQL methods. Global identity and permission system is one of the most important parts in IPP, strongly inspired by those developed in first cloud systems. When using IPP, every call on the public web has an identity bound to it, where a receiver can look up its roles, reputation and claims, but only to the extent to which caller allows it up-front, for a given invocation.
A simple persistent state declaration:
persistent people: Map<string, User>
This provides a binding to a persistent key-value store to keep track of User objects. We need to separately configure the storage layer to use one of the supported databases, or simply consume the default one — the public decentralised key-value storage, “GStore” (with G for global). It would make sure it always keeps enough replicas of the data close to the nodes with most traffic flowing through. The cost of reading and writing to this storage would then depend on the amount stored and the bandwidth, latency settings we choose.
Our data is be persistent even if we remove this declaration from the code. As long as we declare it in the same namespace again, we can regain access (after making sure we can continue paying for it). However, when deploying a new version of the code, we would be prevented from changing the data types in the declaration to avoid incompatible changes.
The “GSchema” blockchain that keeps track of all API schemas and types. It prevents Github developers from removing any fields in the User type, though adding new ones is permitted. This makes it safe to re-use this type in our code and storage, or even expose it again to the public as part of custom aggregate types. The blockchain layers ensure that no backward-incompatible changes can be introduced.
Running the module
Modules such as the example above can run on a FaaS-like environment we choose from among different providers, based on desired cost, speed and security. If the data flowing through the module does not need to be particularly secure and could tolerate failures from time to time (resulting in retries) it would make most sense to run it on “GFunc”, a public infrastructure of millions of nodes where anyone can participate to run the code, in exchange for a small fee on demand.
GFunc is relatively reliable, though a random slow-downs or retries need to be expected, as there is no guarantee about what kind of node the function would run on. Speed and reliability can be influenced somewhat by adjusting the fee, so that the network tries to move the execution to more well-behaving nodes with higher rating. Also, if other compatible public services are used, e.g. GStore, the data flow is optimised over time for latency, by passing through execution and storage nodes located closer to each other.
Many alternatives exist, of course. If you need more bandwidth, better latency or fewer surprises, use a more expensive network (major corporations’ clouds are most reliable). Essentially free options are there too, run by companies for promotional purposes (or possibly on compromised delivery drones…).
URIs, most of the time, are just identifiers, not locators — the called module can potentially run anywhere, in multiple instances. Thanks to dynamic call topology adjustments, GFunc is usually able to run the invocation on the same physical machine as the caller. Because of this, public API calls are nowadays frequently preferred over bundled libraries (with “Open APIs” being the largest community-managed free software always available on the Web).
State Contexts
Handling stored data has been one of the most difficult problems to tackle in IPP. Making stateless Open APIs reliable is relatively easy, but once a state is involved, it becomes ever harder to manage over time. Because of bugs or mishandled edge cases, data can become corrupted or unusable, even if it’s an append-only log. And if it’s a shared multi-tenant state, then maintenance and manual interventions are unavoidable. This is something that specifically needed to be solved in the Open APIs initiative, as the reliability cost and amount of manual work required would be too high otherwise.
Over time, keeping multi-tenant data in a single table or namespace became a huge anti-pattern. Instead, each piece of stored data is explicitly bound to a context, usually the identity of the caller. GStore does not even allow access to any state outside of a context, to prevents accidental (or purposeful) leaks of data between users.
Contexts can be also dynamically created and widened. If, for example, an API allows you to create an organisation, the caller identity would become the owner of data stored in the context of this particular organisation. Later, after another identity becomes a member of this organisation, it may retain a read-only access to the shared organisation’s data.
Owner of a state context is also able to delete all data stored inside it, if they wish to. This may have a transitive effect on data stored in many separate Open APIs, as the contexts may be dependent on each other.
If required, it still remains possible to store data without a context. This is generally frown upon (unless for test purposes), but it’s unavoidable for certain use cases, like training neural network models. This kind of state is usually anonymised and kept to a minimum, with the code being well-audited.
Error handling, delivery guarantees
If you are using RPC via IPP, you get some features out of the box, like a default configuration of retries and timeouts. Lowest bound on delivery guarantee remains “at least once”, so you may encounter duplicates, but in practice it does not happen thanks to built-in deduplicating checks between module boundaries.
Of course, if it’s not the infrastructure that’s at fault, but your code, you’ll need more explicit error handling. Your favourite language construct will do fine (people are still yet to agree on what is the best error handling approach— this remains an expensive question on prediction markets…), but some ideas inspired by actor model are also popular, like creating dynamic modules (with or without data contexts) that get destroyed and recreated if errors occur. The rule of thumb should be to do everything you can to prevent errors affecting multiple callers at the same time. If some bug ultimately slips through and affects part of calling identities, automatic refunds and bonuses can be issued. Some organisations are even giving out quite large bounties for encountering surface errors, to advertise their reliability.
Infrastructure bugs and transient errors (those exceeding the retry count) are fortunately rare these times — people running the infrastructure layer are competing for reputation, so can’t allow that.
Exposing modules
Most modules running on this open infrastructure can be easily exposed to the public. Gone are the days of configuring gateways and public IPs (unless you are a developer for the underlying infrastructure), as the network is dynamic and higher-level tools are in use in typical user-facing development. You’ll just need to define required permissions and pricing in the module manifest. Individuals and companies running public APIs usually prefer to use a fee structure based on complexity of executed logic, where a cost per byte can be estimated. Others, especially when in control of more innovative data sources or functionality, prefer more fine-grained billing.
You’ll still need a domain from GNS (global name system) — that can be registered via a name blockchain after providing personal or organisation identity details and usually takes a few minutes. Your identity is not exposed to the public however, the network just needs to automatically make sure your reputation is high enough and if the name does not try to mislead anyone.
Finally, you will point a subdomain to one of the versions of your module. IPP makes it easy to distribute the traffic to multiple versions, based on call metadata: origin, identity or randomly with weights. This is useful on many stages of testing and for canary releases.
Decentralised governance
Ownership of some most popular free APIs is maintained by one of the large decentralised democratic organisations, built around their user’s values. One of the largest, OpenOrg (who have their roots in a merger of several major free software movement players of the past) also governs the operation and development of GFunc, GStore and Open APIs. Their main principle is strict backward compatibility of all code changes, believing that over time it protects the immense reliability and reputation of the whole infrastructure. Other than that, they don’t impose many rules, so the majority of developers choose to participate, while still being members of other, more specialised groups.
Maintaining this open infrastructure and software can be a little tricky from a governance perspective. Changing production code is usually a very slow process (adding new fields or functions to the public schema even more so), as it needs to be accepted by a weighted majority of members. Each members weight is based on their reputation, participation and other factors, like hierarchy in some cases. However, well-designed testing stages can help to speedup this process.
Testing
Thanks to the RPC infrastructure, modules we control are not really different from external ones and serve as functionality boundaries (like microservices in the past), so testing each module individually is usually straightforward. Calls to other modules can be mocked out and the database replaced with a test one. You don’t really run the tests locally anymore — the web has all kinds of testing tools: mock generators working on any public schema, traffic generators for performance testing, temporary databases, test data markets.
Public testnets are the next step and a quite efficient method of experimenting with new functionality. They are very much like alpha and beta release channels of the past, but have more direct roots in first blockchains’ test networks. Every user of Open APIs can switch an invocation to a particular testnet if they wish to. That logic is unstable and subject to change, but the owners of the testnet can easily create bug bounties and other incentives (e.g. pay a small fee per invocation) for anyone to try it out. There are of course markets for discovering best testing opportunities, code reviews and creating new functionality in the first place.
New functionality on testnets is constantly being rated, so a popular and high quality change has pretty good chance of being accepted.
Summary
That was a quite subjective vision on how development on the web could look like if we follow current trends towards increasing reliability and decentralisation:
- Market-based forces and standards developed on blockchains: common identity and role system, payment methods, serverless computing and storage fog, rating and reputation networks.
- Large scale DAO-like code and API governance for free software.
- Higher level protocols on top of IP and HTTPS for glueing code modules together with caching, retries, exactly-once delivery.
- Versionless APIs, immutability and data control standards.
Back to reality
The story above makes some potentially controversial assumptions:
- The schema/types of an API must remain unversioned and forever backward compatible and possibly extensible. This is the approach encouraged in GraphQL, Protobuf and enforced in Linux kernel development and many APIs. While it has drawbacks and can become painful and messy in time, I think it might be the only way to ensure the whole backend of the internet can depend on itself and function with relatively small maintenance over time. It’s very much in the mindset of reproducible builds that’s becoming ever more popular. Hopefully we can learn from situations like public module removal disasters.
- Data contexts and callers having control over their data could be hard to implement in practice without relying on messy exceptions from rules when stakes are very high. Data is increasingly But when thinking about reliability of the web, this seems to me like the only direction we can take.
- Common data schemas. Rather less important assumption, but once we have immutable public data structures that we can rely on, perhaps we can largely avoid creating endless custom ones. Think: universal Github for data. People frequently have it difficult to agree on definition of things, but if this approach gains popularity, we might have yet another solid foundation we can build on.
- Stored data can be encrypted, but flowing the state through pseudo-anonymous computing fog might be prone to privacy issues and data leaks. Perhaps reputation systems or homomorphic encryption will become mature enough to prevent this.
- Nobody knows yet if decentralized organizations can work in practice. But perhaps we will in time find a right balance of incentives and rules to make sure this kind of sociocracy feasible.
A note on blockchains
There might be a path where blockchains will play only a minor role in this change. This is because of the trend towards virtually zero-cost computing and the fact that, in practice, most developers already trust cloud providers with their (or their customers’) data. Most of the ideas I mention in this article can as well be shaped in the ubiquitous infrastructure of faceless, closed clouds, where world’s information flows just as anonymous drops of water in underground pipes. After all, we have been trusting the water to be protected, and not poisoned, for quite some time now.
Perhaps new cloud subscription models will emerge where, as a developer, paying fees for on-demand processing power and data will be taken for granted, just like access to internet itself. Cloud standards will emerge over time and when every online device or app is finally located in the same environment, the free web of pieces of logic, constantly being shared and remixed, will forget the machines and trust it will be growing on.
Inspirations
- IAM standard: Identity, permission systems found in cloud providers; Solid; blockchain-based ideas
- Serverless logic: Serverless framework
- Serverless datastores: FaunaDB, Amazon Aurora Serverless, CloudSpanner
- Immutability and decentralization: IPFS, also pub-sub; blockchain counterparts, Beaker browser
- Typed data interfaces: GraphCool, ApolloGraphQL, gRPC
- Free funding: OpenCollective and blockchain-based free software funding