Story

In order to serve public facing web services, there are may very well designed web servers like Apache, NGINX or cloud services like AWS API gateway. However, in order to serve more timely data to more customers, it is common to use WebSockets so to eliminate overheads and handshakes which are introduced by HTTP polling.

In kabu.com, WebSocket will be a good choice to make use of HTML5 features to stream market datas to our customers. Since the number of connections is expected to be very high, it is crucial to find a good WebSocket library so to minimize the total operational cost.

Scenarios Setups

Until the end, I tried two different tests. 1. Echo Server 2. Subscribe to a message channel and return large amount of data

Why not Autobahn tests? Autobahn aims for correctness check. Using Autobahn as a stress test is only suitable for comparing general usages but we aimed for a specific case of publishing to large amount of subscribers this time.

Choosing Candidates

Gorilla in Golang

I personally thought that Gorilla/websocket would be one of the best performers before the tests.

Gorilla is an experienced development team of networking library. Thanks to the famous Goroutine scheduler, and a popular presentation of 1-million WebSocket connections using golang (Link) with less than 1GB ram. It should be a good fit for large amount of WebSocket connections.

Gorilla/websocket is having greater popularity than Gobwas/ws, so it is being chosen as the first candidate.

Netty

Netty is the fastest (Reference) async library in Java and battle tested JVM should be trustworthy.

Famous asynchronous Java framework, Vert.x(https://vertx.io/) is based on Netty, while Netty is having more control on binary data (which will be more suitable for our later business case).

Ktor in Kotlin

Kotlin has interchangeability with Java and with much richer language features. Ktor is making use of Netty and I didn't expect much difference from it and Vert.x. However, if it performs more or less similar like Netty. I would like to adopt to it due to the clean implementations of the final program and thus reduce potential technical debts.

websockets_ws in JavaScript

In our business case, more calculation on messages are being made so JavaScript is not a preferred choice at the first sight. However, the asynchronous model of V8 is performant. Event driven model of Node.JS over V8 is a long-time for choice web server and WebSocket server too.

So I still want to give websockets_ws in JavaScript a basic test first. If it works similar to Java/Go, then I may drill down deeper. JavaScript is usually having very good development-time/concurrency ratio in my experience.

Tungstenite(and WebSocket) in Rust

I considered WebSocket(https://github.com/websockets-rs/rust-websocket) first, but the author warned us that

Note: Maintainership of this project is slugglish. You may want to use tungstenite or tokio-tungstenite instead.

, so I switched to Tungstenite .

The usage of both of them is similar. But because we don't have a very up-to-date asynchronous model to test both of them. (Side note: I was usually working with Rust and up-to-date knowing that async syntax is now outputting futures; but tokio-tungstenite, which is the half-official release, is using futures 0.1.x, which is not compatible with the "modern" async executors)

WebSocket in Java

WebSocket is another Java implementation.

It is not as famous as Netty but because it makes use of java.nio, which is the low-level asynchronous for Java. It looks to be minimalist enough and active project, with enough population too.

Test Summary

Artillery as the testing tool

I tried to use artillery first. It is modern, powerful and easy-to-use. It has a very nice feature of procedure setting with JavaScript support. such as: (this is my testing script as well)

config:
  target: "ws://p2:8080/echo"
  processor: "./my-functions.js"
  phases:
    - duration: 5
      arrivalRate: 1000
  ws:
    # Set a custom subprotocol:
    Sec-WebSocket-Protocol: abby-test
scenarios:
  - engine: "ws"
    flow:
      - send: "test"

However, it has a very high CPU usage (which triggered the run-time warning) that could potential harm the accuracy of measurement when I try to push up arrivalRate.

I am not blaming it, while it is written in JavaScript. Although it may not do benchmarking well, it is still a very good tool for functional test and stress-test in production in my opinion.

Results from Artillery

I finished an echo test(by sending "test" 4 byte message) making use of Artillery. All libraries performs 99% percentile at a result of 0.1 or 0.2 ms latency for 897/s where I targeted 1000/s in the setting. The upper limit of functional rate of artillery is relatively low.

Self-developed stress testing client in Gorilla/websocket

Therefore, for a more realistic test, I made use of Golang over library Gorilla.

This test is done with the following settings:

+-------------------+      +----------------------------------+
|                   |      |                                  |
| AWS EC2 r4.xlarge |2500  |                                  |
|                   +------>                                  |
| (Client)          |      |                                  |
+-------------------+      |                                  |
                           |      AWS EC2 r4.xlarge           |
+-------------------+      |      (Server)                    |
|                   |      |                                  |
| AWS EC2 r4.xlarge |2500  |                                  |
|                   +------>                                  |
| (Client)          |      |                                  |
+-------------------+      +----------------------------------+

(r4.xlarge is 4vCPU, 30.5GiB ram and EBS)

There are 5000 WebSocket connections invoked from clients.
The server, after recving a heartbeat signal, will pushes 10 WebSocket send to the invoker.
Each request thread in each client invoke the server every second and check if there are 10 * 10000-byte messages.

+--Every Second--------------------+
|                                  |
|  +---------+ seq    +---------+  |
|  |         +-------->         |  |
|  | client  |        | Server  |  |
|  |         <--------+         |  |
|  |         |10KB*10 |         |  |
|  |         <--------+         |  |
|  |         <--------|         |  |
|  |         <--------|         |  |
|  |         <--------|         |  |
|  |         <--------|         |  |
|  |         <--------|         |  |
|  |         <--------|         |  |
|  |         <--------|         |  |
|  |         <--------+         |  |
|  |         |        |         |  |
|  |         |        |         |  |
|  +---------+        +---------+  |
|                                  |
+----------------------------------+

Tester code abstract:

//...
func main() {
    if runtime.GOOS == "linux" {
        // Increase resources limitations
        var rLimit syscall.Rlimit
        if err := syscall.Getrlimit(syscall.RLIMIT_NOFILE, &rLimit); err != nil {
            panic(err)
        }
        rLimit.Cur = rLimit.Max
        if err := syscall.Setrlimit(syscall.RLIMIT_NOFILE, &rLimit); err != nil {
            panic(err)
        }
    }
    var wg sync.WaitGroup
    var v, _ = strconv.Atoi(os.Args[1])
    for k := 0; k < v; k++ {
        wg.Add(1)
        go goZilla(&wg, k)
    }
//...
}

func goZilla(wg *sync.WaitGroup, id int) {
//...
    go func() {
        defer close(done)
        for {
            _, _, err := c.ReadMessage()
//...
        }
    }()

    ticker := time.NewTicker(time.Second)
    defer ticker.Stop()

    for {
        select {
//...
        case <-interrupt:
            log.Println("interrupt")

            err := c.WriteMessage(websocket.CloseMessage, websocket.FormatCloseMessage(websocket.CloseNormalClosure, ""))
            if err != nil {
                log.Println("write close:", err)
                return
            }
            select {
            case <-done:
            case <-time.After(time.Second):
            }
            fmt.Printf("Worker %v: Finished\n", id)
            return
        }
    }

}

A little hint here: please be sure to set syscall.RLIMIT_NOFILE on linux so to ensure to use the maximum file descriptor that the OS is providing. This practice is also applied to the Gorilla server as well.

Result

I recorded the upper limit of clients that could be supported without runtime error.

Running through the test of go run tester.go 2500 500 on two clients where simulating 2500 * 2 connections. The following results are given

Gorilla(Golang) : 65.5s 64.1s 293%CPU 1.5%RAM
Netty(Java) : 57.54s 57.57s 184%CPU 6.8%RAM
Ktor(Kotlin) : N/A
websockets_ws(Node.js) : 400% CPU 3%RAM -> timeout
Tungstenite(Rust) : TCP Connection error from client
WebSocket(Rust) : N/A

Details and Comments

To summarize, in other to serve 5000 concurrent connections,

Gorilla in golang and : 65.5s 64.1s 293%CPU 1.5%RAM
Netty in Java: 57.54s 57.57s 184%CPU 6.8%RAM

are the only choices in this list.

Analyzing other failed cases:

websockets_ws(Node.js)

It works at 400%CPU 3%RAM but the connection are being cut due to internal timeout issues. f:id:abbychau:20200109162324p:plain

V8 is making use of 1 CPU only, I scaled it using cluster package (Link)

and it does not seem keep up to 5000 concurrency per node.

It has a timeout issue even the concurrency is lowered to 2500 + 1000 = 3000.

It finishes the test at 2500 concurrency with resource 400%CPU and 3%RAM

Comment: Usable for auto-scaling group. Far less efficient than Gorilla and Netty.

Tungstenite(Rust)

Error of too many open files shown: f:id:abbychau:20200109151306p:plain

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', src/libcore/result.rs:1165:5

Running out of file descriptor in Linux is quite a clear sign of running out of TCP connections.

Configuration of FD using rust crate : rlimit.

const SOFT: rlim = 0;
const HARD: rlim = 0;
assert!(setrlimit(Resource::FSIZE, SOFT, HARD).is_ok());

Tried to put limit to 8 * 1024 * 1024 or 0 or 8000 * 1024 * 1024 * 1000 on both soft limit or hard limit. none of them get me to pass a 2500-per-requesting-node test.

When running the test of 1000 concurrent connections, the test finished in 149 seconds (which is abnormal and failed) although it keeps good resource usage of 400%CPU and 0.1%MEM.

(Stackful coroutine library may is used for asynchronous non-blocking concurrency.)

Comment: Can be a potential choice for small nodes after deeper investigations. Error handling for exceeding TCP connection, logic of "checking, waiting, reusing" of connections and green-threads might be required to be carefully programmed too.

WebSocket(Rust)

As the API binding of this library is based on iterating a single server object, extra developments on the program architecture is required to make it concurrent.

In short, the function works well, but it serves very slowly because of no concurrency introduced.

I introduced a stackful coroutine library may and it turns out to be able to handle 4 connections only.(Looks like there is a scheduler mis-judge and tasks are occupying logical threads). It may take very long time to finish the whole test.

Comment: I cannot make any comment on the result. Because it is a staled repository, I don't recommend this.

Conclusions

websockets_ws in node.js totally usable for small nodes under auto scaling groups, but the running server cost could be obviously higher than Netty or Gorilla.

Gorilla in Go or Netty in Java are better for larger nodes, while Gorilla has a smaller memory footprint and Netty has a lower CPU usage.

Although it is a bit subjective, Golang has better code readability and the compiled binary has built-in runtime, which is more friendly to cloud-native environment too. Thus I would give the winner to Gorilla/websocket to this specific case and Netty is the very close second runner-up.

f:id:abbychau:20200109162947p:image:w300 — Gorilla!? (creative common: https://commons.wikimedia.org/wiki/File:Gorilla_gorilla_gorilla8.jpg)

Last but not least, the memory usage could be much higher while reading from queues. They will need more TCP IO for calls like token authentications, data de-serializations, etc, too. The real-world application is not always IO-heavy even for a WebSocket gateway.