Hey! Let’s build Facebook. Why? Because social networks are hard. Potentially millions of concurrent users, each making potentially numerous reads and writes per second – considering how to handle that is an exercise in architecting robust information schemas. I’ll use Orchestrate in my examples, but these principles don’t much depend on your underlying infrastructure.

This post discusses information architecture, rather than the libraries, frameworks, and code one might use to build a social network. We’ll cover that in an upcoming post.

Posts, Comments, and Likes

To start, let’s consider a post, its comments, and likes on those comments. Each post is an object in your database, but how about comments? Are they their own objects, or do they live inside the post object? Likewise for likes, are they full-blown objects, or can you represent them as an integer on a comment?

Networking 1

Let’s consider if we did make them nested objects, like this:

{
    user_id: '...',
    text: '...',
    link: '...',
    mentions: '...',
    comments: [
        {
            user_id: '...',
            text: '...',
            likes: 5
        }
    ]
}

A post has an array of comments, and each comment has an integer representing how many times it’s been liked. So, when someone post something, we create a post. When someone comments, we update the post. When someone likes a comment, we update the post. What happens if two users update the post at the same time? Say two people post comments in rapid succession, or one person likes a comment as another person is deleting it. Whose write lands first will dictate whether your users experience a confusing error message, data loss, or both.

Instead, let’s consider if posts, comments, and likes were their own objects, like this:

# post
{
    user_id: '...',     # user who wrote the post
    text: '...',
    link: '...',
    mentions: '...'
}

# comment
{
    user_id: '...',     # user who wrote the comment
    post_id: '...',     # post being commented on
    text: '...'
}

# like on a comment
{
    user_id: '...',     # user who liked the comment
    comment_id: '...'   # comment being liked
}

In SQL, we’d make tables for each of them. In Orchestrate, we make collections for each, but it’s the same principle. When a user posts something, we create a post. When someone comments, we create a comment related to the post. When someone likes a comment, we create a like related to the comment.

Unless your database has locks, which prevent handling multiple writes simultaneously, then making each of these concepts into their own objects ensures users won’t step on each other’s toes. Storing likes as their own objects also lets us store metadata about the like, such as who it came from. This will make unliking something that much easier, because we just delete the like object.

How do you get counts of each, though? In Orchestrate, we return a total_count field for searches, which you can use to get counts of objects, like this:

GET https://api.orchestrate.io/v0/comments?query=post_id:$a_post_id&limit=0

This will return a body like this:

{
    "count": 0,
    "total_count": 4,
    "results": []
}

Omitting &limit=0 will get you the actual comment objects, but omitting those objects makes the response smaller, and thus faster to receive and process.

Speaking of performance, how do you handle a tremendous number of comments? Paging!

If a post has even 100 comments, you don’t want to be sending all 100 comments every time a user examines the post. Not even Facebook does that. Instead, send the most recent 20, or 10, or 5, and then provide a link for “See Previous Comments”. This lets users voluntarily request more information, without bogging down your servers and users’ browsers with every single comment every single time. Orchestrate does this paging automatically, but most popular database systems have equivalents.

Activity Streams

When users interact, they want to know about it! How do we leave notifications for users? How do we mark them as read?

On Facebook, individual notifications aren’t marked as read. Instead, a timestamp is set whenever you check your notifications. Notifications from before that timestamp are considered read, while those from after are considered unread. This saves us from performing numerous writes to mark individual notifications as read. Instead, we only update one object: the timestamp indicating when notifications were last checked.

It’s good practice for things like posts, comments, likes, etc., to have timestamps indicating when they’ve been created, updated, etc., but who sets those timestamps? Users can provide falsified timestamps, and the clocks on your application servers can get out of sync, so who do you trust to set the time?

Clocks are evil, so we won’t use POSIX timestamps generated by clients or servers. Instead, we’ll use Orchestrate’s Events. They use timestamps which are necessarily increasing for a given key. So, for every notification, we’ll add an Event object. Creating an Event without a timestamp asks Orchestrate to generate one for you. To make sure system clocks never have a chance to interfere, we’ll always use those Orchestrate-generated timestamps to indicate when a user last read their notifications.

Let’s say we have a notifications collection, where each object in the collection has the same primary key as a user. Each object then looks like this:

{
    "notifications_last_checked": "$timestamp/$ordinal"
}

When a user checks their notifications, we grab that timestamp and ordinal, and list events after it, like this:

GET https://api.orchestrate.io/v0/notifications/$user/events?afterEvent=$timestamp/$ordinal

When the notifications are read, our application should update the notifications_last_checked field with the latest event’s timestamp, so that future reads will only indicate new notifications.

News Feed

When you first visit Facebook, what do you see? Your news feed! How do we build that? Two steps:

  1. Get all the user’s friends.
  2. Get all the stuff they post, in order of creation.

Facebook doesn’t actually do “in order of creation”, since they enjoy manipulating what you see when, but we will return results in the order they were created because

  1. It’s easier.
  2. Performing scientific research on your users without explicit consent is unethical and gross.

I’ll use Orchestrate’s Graph search as an example, but you could use your own database’s equivalent, or make one if it lacks. (Or, of course, sign up for Orchestrate.)

When users become friends, we’ll create a relationship between the two, like this:

PUT https://api.orchestrate.io/v0/users/$user_id/relation/friends/users/$other_user_id

This creates a relationship from one user to the other of the type “friends”. To make it reciprocal, just reverse $user_id and $other_user_id and PUT again.

Then, to create our news feed, we’ll get all our user’s friends using graph search:

GET https://api.orchestrate.io/v0/users/$user_id/relation/friends

That will return all a user’s friends in a body like this:

{
    "count": 1,
    "results": [
        {
            "path": {
                "collection": "users",
                "key": "matt",
                "ref": "0acfe7843316529f"
            },
            "value": {
                "age": 23,
                "name": "Matthew Jones"
            }
        }
    ]
}

Then, we’ll take every username from those results, and map them into a Lucene query, like this:

GET https://api.orchestrate.io/v0/posts?query=user_id:($user_id1 OR $user_id2 OR ...)

This will return a paginated list of posts by each of our user’s friends. Bam, news feed.

Conclusions

When building systems for massively concurrent userbases, like social networks, making sure objects can be created or modified by as few users as possible will prevent users from overlapping and overwriting one another’s data. If the only person who can edit a post is its author, then there are few opportunities for problems to arise.

If you decide to build your own social network, hit us up in our community chat and let us know what you’re up to. We’d love to help!

Happy coding!