Saturday, December 25, 2021

Building a web 3.0 Twitter clone

I am pretty certain that the whole web 3.0 story is a mind-bogglingly stupid idea, but I like to challenge my own views. I thought one way to do that is to explore how I’d build a web 3.0 twitter clone. This post is this exploration. I’ll try to suppress my bias as best I can.

What is Web 3.0? ΒΆ

Some years ago some people had the idea of a semantic web. Each site machine readable, interconnected and cross-referenced with each other. They called it web 3.0. Unfortunately that idea died - too little interest and too much effort.

Instead the web 3.0 got re-purposed by a different set of people to create some sort of decentralized data networks. It’s still a vague idea. Nobody really knows how exactly it should look like, except that it needs to be decentralized and use at least one blockchain. (Source)

Requirements ΒΆ

What do we need to build a Twitter clone?

Twitter consists of users and tweets. Usually I would store those in a database, but in a Web 3.0 world the tweets have to go into a blockchain. I assume users would be a combination of a wallet plus meta data. I’ll ignore the meta data for now as it’s kinda inconvenient.

Transaction rates and storage ΒΆ

I like to do some napkin math when designing systems to get a sense of what kind of load the system has to handle. According to internetlivestats, people currently write an average of 6000 tweets per second (Actually 9796 in real-time the moment I checked). From a pure write load perspective that’s a number a single server could easily handle.

Looking at the maximum, things start to look different. According to a Twitter blog post from 2013, the record is at 143.199 tweets per second. I wouldn’t be surprised if this is outdated and the number is higher now, but for the sake of the post let’s stick to it. This is a number single node systems can struggle with. The twitter blog post confirms this - they ended up building a sharding layer on top of MySQL.

Now how would we store 6000 tweets per second on a blockchain? How about the 143199?

When looking at etherchain the current transaction rate is 15 transactions per second. I found other mentions of 36 transactions per seconds as limit with Ethereum 1.0. (And apparently Ethereum 2.0 is going to raise this and make everything better, but it also sounds as if Ethereum 2.0 is in a “coming soon” state for several years now).

I suppose this means we need to figure out a way to batch multiple tweets into transactions. How many tweets could we fit into a transaction? Ethereum writes multiple transactions into a block. On average a block seems to have about 70kb in size. (Source)

A tweet can contain 280 characters and contains various meta data. Let us trim it to the minimum: We’ve ASCII text (RIP emojis πŸ’€), a 16 byte user id and a 8 byte timestamp, totalling to 304 bytes per tweet.

Could we store it compressed? Maybe, probably? If not we end up with roughly 1.8 MB storage required per second going by the 6000 tweets per average with a peak of 43 MB per second. That’s way too much for our poor Ethereum blockchain.

Layer 2 ΒΆ

Since we cannot store everything on the main blockchain we’d need some kind of layer 2 system. Going by this blog about layer 2 there are various different approaches to that. All the approaches have one element in common: They delegate the majority of the work to another system and only piggyback onto the main blockchain for a subset of work to benefit from the security of the blockchain. (Whatever security means in this context?)

The Web 3.0 Architecture post mentions this problem as well and offers IPFS or Swarm as solutions.

This would allow us to only store a tweet id on the block chain and off-load the actual content elsewhere. Assuming an id requires 16 bytes we’d be at 96 KB for our 6000 tweets per second. That’s closer to the current block sizes, but still sounds as if the transaction rate/size ratio is too big. If consider the 143199 maximum which amounts to more than 2 MB things look even more bleak. There is no way we can store each tweet id on the blockchain.

A solution is to decouple and batch further. We can store a reference to a batch of tweets on the blockchain and store a mapping from batch identifier to tweet identifier.

 Ethereum Transaction
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚TweetsBatchIdβ”‚
    β”‚  16 Bytes   β”‚                  Off storage
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”˜        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚            β”‚  TweetsBatch   β”‚        β”‚  Tweet    β”œβ”
              └───────────►│[6000 tweet ids]│───────►│  Content  β”‚β”œβ”
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚β”‚
                                                      β””β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
                                                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Or we store the content together with the batched IDs like this:

 Ethereum Transaction
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚TweetsBatchIdβ”‚
    β”‚  16 Bytes   β”‚                  Off storage
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”˜        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚            β”‚ 6000 tweet ids β”‚
              └───────────►│    + content   β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

But who does the batching? I assume this is not possible or not reasonable to do as smart contract (please correct me if I’m wrong). This means we need an additional entity.


     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚Frontendβ”‚
     β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜
          β”‚
          β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚Backend for batchingβ”‚
     β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β–Ό            β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      └─────────┐
     β”‚Blockchainβ”œβ”€β”€β”€β”€β”€β”€β”         β–Ό
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       └──────►│Off-Storageβ”‚
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I think this could work. But we have some more requirements: Mentions and direct messages.

Mentions may require us to send some notifications and to have some sort of read status. In a database that is an easy problem. You’ve another record with some sort of flag indicating if a message has been read. Once the mention is read you can update the flag. With systems supporting only write/append operations, this becomes a lot more complicated.

Direct messages add another interesting requirement: They must stay private. If we store them in public systems we need encryption and should think about forward security. Deletes are another concern. Almost trivial with a database, a complete pain with a blockchain where we cannot delete anything.

We’d also need to be able to present user specific timelines and offer search functionality among others. This requires additional indices if we want this whole system to perform at a level that’s acceptable for users. Storing the plain tweet contents is not enough.

We are already way past the point where I’d ask “What’s the point of having a batch id stored on the Blockchain?”

Software is a means to an end ΒΆ

Many programmers like puzzles and challenges. We like to tinker and play around with technology. But we should not forget that in almost all cases we’re paid to build systems to solve some problem. Twitter allows us to connect with people, share impressions and information (or the bleak take: Be angry all the time and shout into a void). The software powering the system is a means to an end, not the goal. The majority of end-users do not care about the underlying implementations. I have a hard time imagining a user opening up Twitter and thinking “I wish they stored my tweets in PostgreSQL instead of MySQL”. Some may care about data ownership, but only insofar it affects their ability to do what they want to do.

A blockchain as used in this posts exploration adds nothing of value to the user. Any database with public read access would achieve the same at much lower costs and reduced complexity. If you want to build decentralized systems, I recommend to learn about federation, ActivityPub and look at how Matrix or Mastadon work.

If you want to design good and reliable systems, you should take a good look at the requirements and try to utilize as many limitations as possible. Constraints are your best friend. If you don’t need something you drop it to reduce complexity and cost of implementation. If you follow this approach you’ll have trouble finding a problem where a Blockchain is required.

More on Web 3.0 ΒΆ