I am Abby from System Development Department in auカブコム証券. I am now developing a SNS(Social Network Service)-like news feed application for public user.
SNS is a Multi-Publisher and Multi-Consumer system, which is usually busy due to heavy writing. And when it requires persistency, the bottleneck usually concentrates on the database.
The major choices in the world are now Relational Database Management System(RDBMS, e.g.MySQL), document-based databases (e.g.MongoDB) and distributed databases(e.g. DynamoDB,Apache Cassandra). These three choices represent exactly the three side of the triangle of the classic CAP theorem.
According to certain reasons, I chose DynamoDB (from AWS) as the storage.
This article will explain the thoughts behind.
Storage
Why not RDBMS
RDBMS is good for being able to accept easy queries through SQL and ACID operations. However, there are major problems for using traditional RDBMS such as PostgreSQL and MySQL.
- Theoretically limited write rate. The clustering way, master slave replication is not boosting up the write rate at all.
- Partitioning Flow. To scale out a tradition RDBMS, we need to define partitions. Partitioning are called during table creation phase or by a stored procedure and a scheduled event within the SQL server. It works but I want to look for neater way.
- Managing custom partitions and clusters' backups are adding workloads to DBAs, and thus will increase the future operation costs as well, in the future.
Why not MongoDB
Actually, MongoDB is the first choice in my mind because of it's flexible query, index building, and rapid record insertion. However, in an SNS, we need to store something more than simply the news. For example, timelines and indexes filtering.
MongoDB is famous for large clusters and provide quick access to certain range of documents/records. However, in MongoDB, large indexes do not support partitioning; with this limitation, it can only be used as a key-to-document store and local indexing. This may not be suitable for long term usage if the storage of timeline and news feed with different indexing attributes.
So at last, I think that DynamoDB can be a workable and easy start.
DynamoDB
Similar to Cassandra, DynamoDB is good at availability and partition-tolerance.
It provides nearly unlimited write rate and non-finalized read rate, which is very good for multi-producer and SNS, that SNS consumers do not need to view finalized data. At the same time, it provides API for consistent read when needed too. (e.g. for Administration)
Most importantly, it is a managed DBaaS that I don't need to manage instances or scaling groups by myself. It is not a commercial for AWS but I think that the price of write units, read units and data sizes for DynamoDB are actually not expensive as a managed service.
I do see a major drawback from using DynamoDB besides its rigid index settings. That is: The price can possibly grow very quick because when you assign a GSI for the table, it is essentially a duplication of a table. Although you can limit the WCU(Write capacity unit), RCU(Write capacity unit), required projections of it, the storage price can grow much quicker than the normal MySQL+Redis combo if you use indexes like a normal RDBMS.
Language and Libraries
Golang
We considered Java, Golang, NodeJS for the language of the server, because they are good (e.g. their famous Goroutine, NIO, non-blocking callbacks) for handling IO-heavy systems and thus can provide better throughput per instance/process. For the ease of management, relatively stronger typed language, Java and Golang are shortlisted. They are actually very similar in terms of backend development and I chose Golang for a more lightweight code base and smaller and faster Docker image(for AWS ECS) results. (Golang has a very small built-in runtime within the compiled binary and thus JVM-like runtime is not needed in docker images)
Because unlike MongoDB, DynamoDB requires constant type on the same column or nested structure(e.g. Set and Array). Dynamic Language like JavaScript(JS) also loses part of its flexibility power due to it.
Guregu
Comparing to Cassandra, there is usually infamous comments on DynamoDB that Dynamo is with a very complex query syntax comparing to the SQL-like CQL in Cassandra. I personally agree with this argument.
The example of reading records using AWS SDK only is not very handy because of the requirement of specifying non language datatypes in queries.
tableName := "Movies" movieName := "The Big New Movie" movieYear := "2015" result, err := svc.GetItem(&dynamodb.GetItemInput{ TableName: aws.String(tableName), Key: map[string]*dynamodb.AttributeValue{ "Year": { N: aws.String(movieYear), }, "Title": { S: aws.String(movieName), }, }, }) if err != nil { fmt.Println(err.Error()) return }
So I decided to use Guregu, which ease the coding work of type wrapping needed in AWS SDK. (it exists for the reason of allowing null value as a pointer)
The usage will become
var result widget err = table.Get("UserID", w.UserID). Range("Time", dynamo.Equal, w.Time). Filter("'Count' = ? AND $ = ?", w.Count, "Message", w.Msg). // placeholders in expressions One(&result)
I was quite worried of the existence of hidden bug in Guregu, but I am happy with it until now. The code-base of this library is quite readable as well, so it gave me confidence of fixing it when there is an issue.
Conclusion
Here are some of the choices of this backend system. Unlike the front-end, that the resources are usually static; I focused on keeping maintenance cost low and could keep for many years. I think cloud-friendly components are friends of us these days. There is quite obvious vendor locking to AWS, but I see AWS has very little reason to lose its leading role in Japan's I.T. industry.