Focus on Golang, Python, DB, cluster
.
Foreword:
Colleagues wrote an api gateway service that required me to perform concurrency and stability testing. One said that the pressure test will remind you of ab, wrk tools. Apache's ab performance is a bit unsatisfactory, although the event is also epoll, which is a single thread, can not be filled with cpu. Wrk is a good thing, based on the event pool encapsulated by redis ae_event, in addition to multi-threaded mode and lua script. But if the logic of the pressure measurement is more complicated, then lua is not good, especially when third-party modules are introduced. As a gopher for two or three years of experience, it is natural to use golang to write stress test scripts.
The article is still being updated and revised, please move to the original address http://xiaorui.cc/?p=5577

When performing the pressure measurement, we found a go performance problem, whether it is the http pressure test client or the api server, there is a problem that the cpu utilization rate is not high. No matter how big your coroutine is, the cpu is always running out of dissatisfaction and the utilization rate is not high. Top look at each cpu core idle idle a lot, soft interrupt is no problem, the kernel log is not error, the network's full connection and semi-connection are not abnormal, network bandwidth is not a problem.
Note: 5000 associations and 10,000 coroutines, under the http pressure test scene, his cpu performance is the same.

The process of analyzing the problem?
When we turn off the API forwarding function of the server, we only keep the web function. Using the wrk pressure test, you can see that the server's cpu can be filled. The client of the pressure test is an http request, and the api gateway is also a http request, and there is a commonality. Is it possible to guess the bottleneck of the go net/http request?
Below is our analysis report by go tool pprof. Found that net/http transport takes a bit of time, transport is just a net/http connection pool, it should be very fast. There are two methods that are relatively time consuming, tryPutIdleConn is used to plug back the connection, and roundTrip is used to get the connection. Below we analyze the source code of net/http transport.
Let me talk about the data structure of the net/http transport connection pool. The most intuitive feeling is that there are many locks.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
// xiaorui.cc
type Transport struct {
idleMu sync.Mutex
wantIdle bool // user has requested to close all idle conns
idleConn map[connectMethodKey][]*persistConn
idleConnCh map[connectMethodKey]chan *persistConn
reqMu sync.Mutex
reqCanceler map[*Request]func()
altMu sync.RWMutex
altProto map[string]RoundTripper // nil or map of URI scheme => RoundTripper
//Dial获取一个tcp 连接,也就是net.Conn结构,你就记住可以往里面写request, 然后从里面搞到response就行了
Dial func(network, addr string) (net.Conn, error)
}

Continue to see how golang net/http gets available connections from the connection pool. The entry is the RoundTrip method.


1
2
3
4
5
6
7
8
9
10
11
12
// xiaorui.cc
func (t *Transport) RoundTrip(req *Request) (resp *Response, err error) {
...
pconn, err := t.getConn(req, cm)
if err != nil {
t.setReqCanceler(req, nil)
req.closeBody()
return nil, err
}
return pconn.roundTrip(treq)
}

Transport will first call getConn to get the connection, then call persistConn's roundTrip method, select various channels.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// xiaorui.cc
func (t *Transport) getConn(req *Request, cm connectMethod) (*persistConn, error) {
if pc := t.getIdleConn(cm); pc != nil {
t.setReqCanceler(req, func() {})
return pc, nil
}
type dialRes struct {
pc *persistConn
err error
}
dialc := make(chan dialRes)
prePendingDial := prePendingDial
postPendingDial := postPendingDial
....
cancelc := make(chan struct{})
t.setReqCanceler(req, func() { close(cancelc) })
// 启动了一个goroutine, 这个goroutine 获取里面调用dialConn搞到
// persistConn, 然后发送到上面建立的channel dialc里面,
go func() {
pc, err := t.dialConn(cm)
dialc <- dialRes{pc, err}
}()
idleConnCh := t.getIdleConnCh(cm)
select {
case v := <-dialc:
// dialc 我们的 dial 方法先搞到通过 dialc通道发过来了
return v.pc, v.err
case pc := <-idleConnCh:
// 这里代表其他的http请求用完了归还的persistConn通过idleConnCh channel发送来的
handlePendingDial()
return pc, nil
case <-req.Cancel:
handlePendingDial()
return nil, errors.New("net/http: request canceled while waiting for connection")
case <-cancelc:
handlePendingDial()
return nil, errors.New("net/http: request canceled while waiting for connection")
}
}

Finally, the method of returnPutIdleConn return connection is analyzed. When the request is completed, it is selected according to various conditions to be plugged back to the idle pipeline or directly closed.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// xiaorui.cc
func (t *Transport) tryPutIdleConn(pconn *persistConn) error {
...
t.idleMu.Lock()
defer t.idleMu.Unlock()
waitingDialer := t.idleConnCh[key]
select {
case waitingDialer <- pconn:
return nil
default:
if waitingDialer != nil {
delete(t.idleConnCh, key)
}
}
if t.wantIdle {
return errWantIdle
}
if t.idleConn == nil {
t.idleConn = make(map[connectMethodKey][]*persistConn)
}
idles := t.idleConn[key]
if len(idles) >= t.maxIdleConnsPerHost() {
return errTooManyIdleHost
}
for _, exist := range idles {
if exist == pconn {
log.Fatalf("dup idle pconn %p in freelist", pconn)
}
}
t.idleConn[key] = append(idles, pconn)
t.idleLRU.add(pconn)
if t.MaxIdleConns != 0 && t.idleLRU.len() > t.MaxIdleConns {
oldest := t.idleLRU.removeOldest()
oldest.close(errTooManyIdle)
t.removeIdleConnLocked(oldest)
}
if t.IdleConnTimeout > 0 {
if pconn.idleTimer != nil {
pconn.idleTimer.Reset(t.IdleConnTimeout)
} else {
pconn.idleTimer = time.AfterFunc(t.IdleConnTimeout, pconn.closeConnIfStillIdle)
}
}
pconn.idleAt = time.Now()
return nil
}

Why is cpu utilization not going up?
In the statistics of the system call, we found that futex and pselect6 are much more special. Futex is the system call of the lock, pselect6 is a high-precision sleep, he can sleep subtle, nanoseconds. No doubt, no matter what your precision sleep, it will be threaded.
We analyzed the net/nttp transport source and found that there are various shared channels and mutex inside, and there are locks inside the channel. I have written an article about the problems caused by the competition of golang locks. On the one hand, there are too many syscalls, on the other hand, there are cases where the CPU is not saturated and the utilization rate is low.
For 啥cpu is not saturated, you are sleeping and go to run cpu, because there is no trigger for handoffp, so the thread will not be added, the existing thread is running pselect6 system call, well, directly paste the runtime code.
Note: My friend asked me a question. When runtime sleep, why does sysmon not retake (), sysmon code says that when it exceeds 10ms, preemption will occur, then handoffp, then startm! Futexsleep sleep is also a few microseconds, will not issue preemption scheduling, when the lock is not available multiple times in the for loop, he will yield cut out.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// xiaorui.cc
func lock(l *mutex) {
...
// wait is either MUTEX_LOCKED or MUTEX_SLEEPING
// depending on whether there is a thread sleeping
// on this mutex. If we ever change l->key from
// MUTEX_SLEEPING to some other value, we must be
// careful to change it back to MUTEX_SLEEPING before
// returning, to ensure that the sleeping thread gets
// its wakeup call.
wait := v
for {
// Try for lock, spinning.
for i := 0; i < spin; i++ {
for l.key == mutex_unlocked {
if atomic.Cas(key32(&l.key), mutex_unlocked, wait) {
return
}
}
procyield(active_spin_cnt)
}
...
futexsleep(key32(&l.key), mutex_sleeping, -1)
}
}
// xiaorui.cc
func futexsleep(addr *uint32, val uint32, ns int64) {
var ts timespec
// Some Linux kernels have a bug where futex of
// FUTEX_WAIT returns an internal error code
// as an errno. Libpthread ignores the return value
// here, and so can we: as it says a few lines up,
// spurious wakeups are allowed.
if ns < 0 {
futex(unsafe.Pointer(addr), _FUTEX_WAIT_PRIVATE, val, nil, nil, 0)
return
}
if sys.PtrSize == 8 {
ts.set_sec(ns / 1000000000)
ts.set_nsec(int32(ns % 1000000000))
} else {
ts.tv_nsec = 0
ts.set_sec(int64(timediv(ns, 1000000000, (*int32)(unsafe.Pointer(&ts.tv_nsec)))))
}
futex(unsafe.Pointer(addr), _FUTEX_WAIT_PRIVATE, val, unsafe.Pointer(&ts), nil, 0)
}

Solve the problem of go net/http cpu running dissatisfied?
The cause of the problem is caused by lock competition. How to reduce the lock competition of net/http? It is not enough to open more than a few net/http transport connection pools. Then do a polling algorithm for the connection pool. Do not lock this polling algorithm! ! ! Locking and creating locks are competing! The idea of improving the tuning of the client and the api gateway is the same.
So what is the problem with opening a Transport connection pool? The number of connections is significantly increased. In addition, during the pre-heating period, there will be new new connections, three handshakes, and the request will be slightly slower, and the latter will be ok. In addition, the http connection will also participate in the heartbeat check of tcp. Of course, this kind of interaction is in the kernel layer, and the upper layer does not need to care.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// xiaorui.cc
var (
clientList = []*http.Client{}
...
)
func makeClientList(count int) []*http.Client {
clientList := make([]*http.Client, count, count)
for index := 0; index < count; index++ {
clientList[index] = newClient()
}
return clientList
}
func getClient() *http.Client {
return clientList[rand.Int()%len(clientList)]
}

Looking at the cpu performance of the client in the multi-transport, the utilization of the cpu is obviously coming up. In addition, QPS throughput has reached about 6W.

At this time, we will look at the time-consuming map of go pprof cpu and find that the time-consuming of the transport is shortened a lot, whether it is the RoundTrip connection and the tryPutIdleConn return connection.
Through the flame map, I see a lot more time-consuming than net/http transport. For example, readLoop and writeLoop take more time. Analyze the source code of these two methods, and the various channels are flying. Now I have no optimization. These two methods are the core read/write logic of net/http, and this CPU consumption is acceptable. There is also a consumption of io/ioutil.ReadAll. ReadAll is constantly in the makeSlice space, which also increases the consumption of gc. You can add a sync.Pool buffer pool later.
How to analyze the performance bottleneck of golang service?
Use pprof to view the flame map and cpu time-consuming statistics, find the suspect object, and then directly look at the source code of the relevant library. When you find too many futex and pselect6, you should consider whether there is a lock conflict.
to sum up:
This is the third time I encountered a problem with the golang lock scheduling, which caused the CPU utilization to fail. At first, I thought it was the bottleneck of golang's coroutine scheduling. Last year when I wrote the cdn service gateway, I also encountered a strange problem, cpu utilization is also not up, but the top sys consumption is relatively large, through the stres sampling analysis futex call number is high, and finally the reason is the map lock The competition caused, improved to map segmentation lock solution.

Friends who are interested in Golang can add groups: 278517979 !!!
In addition, if you think the article has something to do with you! If you want to repay the money , you can use WeChat to scan the QR code below, thank you!
Also mark the original address of the blog xiaorui.cc
.
.

announcement


My github address:

friends who are interested in Golang and Python can add QQ group
Golang things (1)
QQ group: 278517979


Python those things (1) (full)
QQ group: 478476595

Python those things Children (2)
QQ group: 264558416

{ 2000 people QQ group has a lot of factories in the big cattle, often organize online sharing and salon, interested in high-performance and distributed scenes are also welcome to join the QQ group}

something can send mail, rfyiamcool @163.com

I submitted the project to pypi, [said that it was deleted some] pypi address

Article filing

July 2018 (3)
June 2018 (5)
May 2018 (3)
April 2018 (5)
March 2018 (4)
July 2017 (4)
June 2017 (4)
May 2017 (4)
April 2017 (3)
March 2017 (3)
July 2016 (8)
June 2016 (6)
May 2016 (14)
April 2016 (14)
March 2016 (38)
January 2016 (21)
October 2015 (12)
August 2015 (15)
July 2015 (11)
June 2015 (12)
May 2015 (26)
April 2015 (17)
March 2015 (8)
January 2015 (14)
October 2014 (12)
July 2014 (7)
June 2014 (4)
May 2014 (17)
April 2014 (12)
May 2013 (1)
.
© 2018 Feng Yun is her. All rights reserved.
.