概要

いろいろとベンチマークを取った時のメモ

参考：

An Overview of Go's Tooling - Alex Edwards ← 超イイ
High Performance Go Workshop ← 神
Effective Go - The Go Programming Language ← 公式
Go slice ベストプラクティス - Qiita

【題材１】sliceのappend

参考： Bad Go: not sizing slices. Why is it better to set the capacity on… | by Phil Pearl | The Startup | Medium

以下（大体）処理速度が遅い順番に記載

BenchmarkTest1: 単純なappend
BenchmarkTest2: 事前に追加するデータの個数分長さを確保してからappend
BenchmarkTest3: 事前に追加するデータの個数分容量を確保してからappend
BenchmarkTest4: 事前に追加するデータの個数分長さを確保してから要素を指定して代入

package main

import (
    "fmt"
    "testing"
)

var N = 1000000

func BenchmarkTest1(b *testing.B) {
    b.ResetTimer()
    list := []string{}
    for i := 0; i < N; i++ {
        list = append(list, fmt.Sprintf("%d", i))
    }
}

func BenchmarkTest2(b *testing.B) {
    b.ResetTimer()
    list := make([]string, N)
    for i := 0; i < N; i++ {
        list = append(list, fmt.Sprintf("%d", i))
    }
}

func BenchmarkTest3(b *testing.B) {
    b.ResetTimer()
    list := make([]string, 0, N)
    for i := 0; i < N; i++ {
        list = append(list, fmt.Sprintf("%d", i))
    }
}

func BenchmarkTest4(b *testing.B) {
    b.ResetTimer()
    list := make([]string, N)
    for i := 0; i < N; i++ {
        list[i] = fmt.Sprintf("%d", i)
    }
}

結果

Ubuntuでやったとき

結果の見方

左から順に、
ループが実行された回数
１ループごとの所要時間
１ループごとのアロケーションされたバイト数
１ループごとのアロケーション回数

引数に.を指定してカレントディレクトリ以下のベンチマークを実行している

$go test -bench . -benchmem -count=4
goos: linux
goarch: amd64

BenchmarkTest1  1000000000               0.269 ns/op           0 B/op          0 allocs/op
BenchmarkTest1  1000000000               0.277 ns/op           0 B/op          0 allocs/op
BenchmarkTest1  1000000000               0.294 ns/op           0 B/op          0 allocs/op
BenchmarkTest1  1000000000               0.297 ns/op           0 B/op          0 allocs/op
BenchmarkTest2  1000000000               0.240 ns/op           0 B/op          0 allocs/op
BenchmarkTest2  1000000000               0.272 ns/op           0 B/op          0 allocs/op
BenchmarkTest2  1000000000               0.251 ns/op           0 B/op          0 allocs/op
BenchmarkTest2  1000000000               0.266 ns/op           0 B/op          0 allocs/op
BenchmarkTest3  1000000000               0.129 ns/op           0 B/op          0 allocs/op
BenchmarkTest3  1000000000               0.101 ns/op           0 B/op          0 allocs/op
BenchmarkTest3  1000000000               0.101 ns/op           0 B/op          0 allocs/op
BenchmarkTest3  1000000000               0.0998 ns/op          0 B/op          0 allocs/op
BenchmarkTest4  1000000000               0.0965 ns/op          0 B/op          0 allocs/op
BenchmarkTest4  1000000000               0.103 ns/op           0 B/op          0 allocs/op
BenchmarkTest4  1000000000               0.102 ns/op           0 B/op          0 allocs/op
BenchmarkTest4  1000000000               0.101 ns/op           0 B/op          0 allocs/op
PASS
ok           43.708s

Macでやったとき

----result----
$go test -bench . -benchmem -count=4
goos: darwin
goarch: amd64

BenchmarkTest1-8        1000000000               0.159 ns/op           0 B/op          0 allocs/op
BenchmarkTest1-8        1000000000               0.174 ns/op           0 B/op          0 allocs/op
BenchmarkTest1-8        1000000000               0.174 ns/op           0 B/op          0 allocs/op
BenchmarkTest1-8        1000000000               0.178 ns/op           0 B/op          0 allocs/op
BenchmarkTest2-8        1000000000               0.195 ns/op           0 B/op          0 allocs/op
BenchmarkTest2-8        1000000000               0.197 ns/op           0 B/op          0 allocs/op
BenchmarkTest2-8        1000000000               0.198 ns/op           0 B/op          0 allocs/op
BenchmarkTest2-8        1000000000               0.195 ns/op           0 B/op          0 allocs/op
BenchmarkTest3-8        1000000000               0.108 ns/op           0 B/op          0 allocs/op
BenchmarkTest3-8        1000000000               0.102 ns/op           0 B/op          0 allocs/op
BenchmarkTest3-8        1000000000               0.108 ns/op           0 B/op          0 allocs/op
BenchmarkTest3-8        1000000000               0.102 ns/op           0 B/op          0 allocs/op
BenchmarkTest4-8        1000000000               0.104 ns/op           0 B/op          0 allocs/op
BenchmarkTest4-8        1000000000               0.105 ns/op           0 B/op          0 allocs/op
BenchmarkTest4-8        1000000000               0.104 ns/op           0 B/op          0 allocs/op
BenchmarkTest4-8        1000000000               0.102 ns/op           0 B/op          0 allocs/op
PASS

【題材２】文字列の二次元Sliceをinterfaceの二次元Sliceに変換する

文字列の二次元Sliceをinterfaceの二次元Sliceにする

やっていることは上の話と同じ部分が多い

性能比較用の関数定義

以下のような関数を用意して性能を比較する

convertStringSlicesToInterfaceSlices1
- なんの工夫もないappend
convertStringSlicesToInterfaceSlices2
- 事前にmake(interface{}, 0, len(ss))で格納するスライスの長さ分容量を確保してからappend
convertStringSlicesToInterfaceSlices3
- 事前にmake(interface{}, len(ss))で格納するスライスの長さ分のスライスを用意して、要素を指定して格納
convertStringSlicesToInterfaceSlices4
- string から interfaceのの変換部分をreflectで行う関数interfaceSliceを使うように、convertStringSlicesToInterfaceSlices3を一部書き換え
- reflectは遅いと聞いているので本当か確認する
- 参考：go - Type converting slices of interfaces - Stack Overflow

func convertStringSlicesToInterfaceSlices1(sss [][]string) [][]interface{} {
    var iss [][]interface{}
    for _, ss := range sss {
        var is []interface{}
        for _, s := range ss {
            is = append(is, s)
        }
        iss = append(iss, is)
    }
    return iss
}

func convertStringSlicesToInterfaceSlices2(sss [][]string) [][]interface{} {
    iss := make([][]interface{}, 0, len(sss))
    for _, ss := range sss {
        is := make([]interface{}, 0, len(ss))
        for _, s := range ss {
            is = append(is, s)
        }
        iss = append(iss, is)
    }
    return iss
}

func convertStringSlicesToInterfaceSlices3(sss [][]string) [][]interface{} {
    iss := make([][]interface{}, len(sss))
    for i, ss := range sss {
        is := make([]interface{}, len(ss))
        for j, s := range ss {
            is[j] = s
        }
        iss[i] = is
    }
    return iss
}

func convertStringSlicesToInterfaceSlices4(sss [][]string) [][]interface{} {
    iss := make([][]interface{}, len(sss))
    for i, ss := range sss {
        iss[i] = interfaceSlice(ss)
    }
    return iss
}

func interfaceSlice(slice interface{}) []interface{} {
    s := reflect.ValueOf(slice)
    if s.Kind() != reflect.Slice {
        panic("interfaceSlice() given a non-slice type")
    }

    ret := make([]interface{}, s.Len())
    for i := 0; i < s.Len(); i++ {
        ret[i] = s.Index(i).Interface()
    }
    return ret
}

関数のテスト

念のため、全部の関数の機能がすべて想定通りかテストをしておく

以下では、makeStringSlicesで試験データを用意して、それを用意した関数に与えて、makeInterfaceSlicesと同じかどうかを比較する

ちなみにSssはStringSlices、IssはInterfaceSlicesの略

func makeStringSlices(n, m int) [][]string {
    var ress [][]string
    for i := 0; i < n; i++ {
        var res []string
        for j := 0; j < m; j++ {
            res = append(res, fmt.Sprintf("%d", j))
        }
        ress = append(ress, res)
    }
    return ress
}

func makeInterfaceSlices(n, m int) [][]interface{} {
    var ress [][]interface{}
    for i := 0; i < n; i++ {
        var res []interface{}
        for j := 0; j < m; j++ {
            res = append(res, fmt.Sprintf("%d", j))
        }
        ress = append(ress, res)
    }
    return ress
}

func TestConvertStringSlicesToInterfaceSlices(t *testing.T) {
    inputSss := makeStringSlices(3, 3)
    wantIss := makeInterfaceSlices(3, 3)
    t.Log("inputSss:", inputSss)
    t.Log("wantIss:", wantIss)
    if !reflect.DeepEqual(convertStringSlicesToInterfaceSlices1(inputSss), wantIss) {
        t.Error("failed to convertStringSlicesToInterfaceSlices1")
    }
    if !reflect.DeepEqual(convertStringSlicesToInterfaceSlices2(inputSss), wantIss) {
        t.Error("failed to convertStringSlicesToInterfaceSlices2")
    }
    if !reflect.DeepEqual(convertStringSlicesToInterfaceSlices3(inputSss), wantIss) {
        t.Error("failed to convertStringSlicesToInterfaceSlices3")
    }
    if !reflect.DeepEqual(convertStringSlicesToInterfaceSlices4(inputSss), wantIss) {
        t.Error("failed to convertStringSlicesToInterfaceSlices4")
    }
}

test実行結果 - inputSssとwantIssが同じことが確認できた

$go test -v sample/sample_test.go --run TestConvertStringSlicesToInterfaceSlices              
=== RUN   TestConvertStringSlicesToInterfaceSlices
--- PASS: TestConvertStringSlicesToInterfaceSlices (0.00s)
    sample_test.go:93: inputSss: [[0 1 2] [0 1 2] [0 1 2]]
    sample_test.go:94: wantIss: [[0 1 2] [0 1 2] [0 1 2]]
PASS
ok      command-line-arguments  0.001s

Benchmarkを計測

各関数の働きがすべて同じだということがわかったので、性能を確認する

以下で、var result interface{}を定義して、result = resと代入しているのには理由があって、for文でループをいっぱい回しても、forの外に影響を及ぼさないとコンパイラがfor文の中身を無視してしまってBenchmarkが想定通りに測れないという話を聞いたのでそれを参考にしている

参考：How to write benchmarks in Go | Dave Cheney

var N = 10000
var result [][]interface{}

func BenchmarkTestConvertSlices1(b *testing.B) {
    inputSss := makeStringSlices(10, 10)
    b.ResetTimer()
    var res [][]interface{}
    for i := 0; i < N; i++ {
        res = convertStringSlicesToInterfaceSlices1(inputSss)
    }
    result = res
}

func BenchmarkTestConvertSlices2(b *testing.B) {
    inputSss := makeStringSlices(10, 10)
    b.ResetTimer()
    var res [][]interface{}
    for i := 0; i < N; i++ {
        res = convertStringSlicesToInterfaceSlices2(inputSss)
    }
    result = res
}

func BenchmarkTestConvertSlices3(b *testing.B) {
    inputSss := makeStringSlices(10, 10)
    b.ResetTimer()
    var res [][]interface{}
    for i := 0; i < N; i++ {
        res = convertStringSlicesToInterfaceSlices3(inputSss)
    }
    result = res
}

func BenchmarkTestConvertSlices4(b *testing.B) {
    inputSss := makeStringSlices(10, 10)
    b.ResetTimer()
    var res [][]interface{}
    for i := 0; i < N; i++ {
        res = convertStringSlicesToInterfaceSlices4(inputSss)
    }
    result = res
}

count=3で各Benchmarkを３回ずつ実行した結果は以下の通り

[~/go/src/github.com/ludwig125_gosample] $go test -benchmem -run=^$ github.com/ludwig125_gosample/convertSlice -bench . --count=3
goos: linux
goarch: amd64
pkg: github.com/ludwig125_gosample/convertSlice
BenchmarkTestConvertSlices1     1000000000               0.0644 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices1     1000000000               0.0650 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices1     1000000000               0.0720 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices2     1000000000               0.0385 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices2     1000000000               0.0363 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices2     1000000000               0.0358 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices3     1000000000               0.0342 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices3     1000000000               0.0349 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices3     1000000000               0.0435 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices4     1000000000               0.0629 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices4     1000000000               0.0662 ns/op          0 B/op          0 allocs/op
BenchmarkTestConvertSlices4     1000000000               0.0655 ns/op          0 B/op          0 allocs/op
PASS
ok      github.com/ludwig125_gosample/convertSlice      5.747s
[~/go/src/github.com/ludwig125_gosample] $

結果

sliceの容量を確保してからappendするBenchmarkTestConvertSlices2か、長さを確保してから要素に代入するBenchmarkTestConvertSlices3が速いことが確認できた
reflectは遅いので使わないほうがいいと再認識した
実験につかった関数名が長くて後悔した

【題材３】スライスの前方に追加する（Prepend）

スライスの後ろに要素を追加するAppendと異なり、スライスの前方に要素を追加するPrependはちょっと難しい

簡単な例はこれ

const size = 32

func prependSimple() {
    data := make([]int, 0, size)
    for i := 0; i < size; i++ {
        data = append([]int{i}, data...)
    }
    // data: [31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
}

ただこれはBenchmarkを取ってみると毎回新しいスライスを作るため性能が悪いことがわかる

ちょっとわかりにくいけど、以下の方がはるかに速い

func prependWithCopy() {
    data := make([]int, 0, size)
    for i := 0; i < size; i++ {
        data = append(data, 0)
        copy(data[1:], data)
        data[0] = i
    }
    result = data
}

copy関数はbuildinのもので、copy(b, a) でスライスaをスライスbにコピーすることができる

Benchmark測定

注意点

グローバルのresult変数に代入することで関数を実行すると外部に影響を与えるようにできる（side effectと呼ぶ）
side effectがない関数は、実行時にコンパイラが判断してfor文for i := 0; i < b.N; i++ { }内の処理を無視してしまって正しくベンチマークを測定できなくなってしまう
- https://dave.cheney.net/high-performance-go-workshop/gophercon-2019.html#watch_out_for_compiler_optimisations が知る限り最も詳しい説明
b.Nを指定することで、性能を測定する上で十分だと判断されるまで自動で計測回数を増やしてくれる
- https://golang.org/pkg/testing/#hdr-Benchmarks

The benchmark function must run the target code b.N times. During benchmark execution, b.N is adjusted until the benchmark function lasts long enough to be timed reliably.

package main

import (
    "fmt"
    "testing"
)

var result []int

const size = 32

func prependSimple() {
    data := make([]int, 0, size)
    for i := 0; i < size; i++ {
        data = append([]int{i}, data...)
    }
    result = data
}

func prependWithCopy() {
    data := make([]int, 0, size)
    for i := 0; i < size; i++ {
        data = append(data, 0)
        copy(data[1:], data)
        data[0] = i
    }
    result = data
}

func TestPrependSimple(t *testing.T) {
    prependSimple()
    fmt.Println("prependSimple:  ", result)
}

func TestPrependWithCopy(t *testing.T) {
    prependWithCopy()
    fmt.Println("prependWithCopy:", result)
}
func BenchmarkPrependSimple(b *testing.B) {
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        prependSimple()
    }
}

func BenchmarkPrependWithCopy(b *testing.B) {
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        prependWithCopy()
    }
}

実行結果

$go test -bench . -benchmem -count=4
prependSimple:   [31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
prependWithCopy: [31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
prependSimple:   [31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
prependWithCopy: [31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
prependSimple:   [31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
prependWithCopy: [31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
prependSimple:   [31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
prependWithCopy: [31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
goos: linux
goarch: amd64
BenchmarkPrependSimple-8          453279              2575 ns/op            4848 B/op         64 allocs/op
BenchmarkPrependSimple-8          408256              2610 ns/op            4848 B/op         64 allocs/op
BenchmarkPrependSimple-8          454341              3956 ns/op            4848 B/op         64 allocs/op
BenchmarkPrependSimple-8          441591              2704 ns/op            4848 B/op         64 allocs/op
BenchmarkPrependWithCopy-8       3420060               331 ns/op             256 B/op          1 allocs/op
BenchmarkPrependWithCopy-8       3519646               332 ns/op             256 B/op          1 allocs/op
BenchmarkPrependWithCopy-8       3484824               332 ns/op             256 B/op          1 allocs/op
BenchmarkPrependWithCopy-8       3457867               333 ns/op             256 B/op          1 allocs/op
PASS

copyを使った関数の方がはるかに性能がいいことがわかる

１ループごとの所要時間は 3000 ns/op -> 300 ns/opと約10倍の速さ
アロケーションされるバイト数も5000 B/op 弱-> 250 B/op くらいまで下がっている

ludwig125のブログ

頑張りすぎずに頑張る父

go言語のBenchmarkTestメモ

概要