From time to time, we're experiencing some weird behavior with our NSQ consumers. It might be related to load pike, we haven't found a way to reproduce it 100% sure, but the issue appears once every week at least in our infrastructure.
2018-05-28 10:18:33.363289812 +0200 CEST [nsqworker-2] account - 09b9e490041f8000 - starting
2018-05-28 10:18:33.383662927 +0200 CEST [nsqworker-2] account - 09b9e490041f8000 - done - 0.023424769s
2018-05-28 10:19:16.799691171 +0200 CEST [nsqworker-2] W, [2018-05-28T08:19:16.793795 #15] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 10:19:17.069883413 +0200 CEST [nsqworker-2] E, [2018-05-28T08:19:17.065925 #15] ERROR -- : [host: <host1> port: 4150] Error received: E_FIN_FAILED FIN 09b9e490041f8000 failed client does not own message
So we can see here that between the execution and the FIN message, an error occurs concerning the NSQd connection, then the FIN fails.
It's what's happening in a loop, as after this, the message (with the same ID) is executed again by the process, and fails to FIN again, etc.
When I look at the logging history about <host1>, I get this:
2018-05-28 05:19:58.395189938 +0200 CEST [nsqworker-2] W, [2018-05-28T03:19:58.386943 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:20:58.460665543 +0200 CEST [nsqworker-2] W, [2018-05-28T03:20:58.457603 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:21:58.546145107 +0200 CEST [nsqworker-2] W, [2018-05-28T03:21:58.545515 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:22:58.625993464 +0200 CEST [nsqworker-2] W, [2018-05-28T03:22:58.621311 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:23:58.741414566 +0200 CEST [nsqworker-2] W, [2018-05-28T03:23:58.733896 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:24:58.835791494 +0200 CEST [nsqworker-2] W, [2018-05-28T03:24:58.829582 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:25:58.932305585 +0200 CEST [nsqworker-2] W, [2018-05-28T03:25:58.921822 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:26:59.021718841 +0200 CEST [nsqworker-2] W, [2018-05-28T03:26:59.017355 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:27:59.121631901 +0200 CEST [nsqworker-2] W, [2018-05-28T03:27:59.113451 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:28:59.221418436 +0200 CEST [nsqworker-2] W, [2018-05-28T03:28:59.209866 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:29:59.330347765 +0200 CEST [nsqworker-2] W, [2018-05-28T03:29:59.327889 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:30:59.410571465 +0200 CEST [nsqworker-2] W, [2018-05-28T03:30:59.409541 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:31:59.545008079 +0200 CEST [nsqworker-2] W, [2018-05-28T03:31:59.543487 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:32:59.773488847 +0200 CEST [nsqworker-2] W, [2018-05-28T03:32:59.770228 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
2018-05-28 05:33:59.877373181 +0200 CEST [nsqworker-2] W, [2018-05-28T03:33:59.871865 #16] WARN -- : [host: <host1> port: 4150] Died from: No data from socket
So as you can see it's like if an old broken connection was kept in the pool.
On the nsqd part I have:
[nsqd] 2018/05/28 03:26:59.033853 [10.127.0.116:56708] IDENTIFY: {ShortID: LongID: ClientID:nsqworker-2 Hostname:nsqworker-2 HeartbeatInterval:30000 OutputBufferSize:0 OutputBufferTimeout:250 FeatureNegotiation:true TLSv1:true Deflate:false DeflateLevel:0 Snappy:false SampleRate:0 UserAgent:nsq-ruby/2.2.0 MsgTimeout:60000}
[nsqd] 2018/05/28 03:27:59.101820 PROTOCOL(V2): [10.127.0.116:56708] exiting ioloop
[nsqd] 2018/05/28 03:27:59.101925 ERROR: client(10.127.0.116:56708) - failed to read command - read tcp 172.17.0.3:4150->10.127.0.116:56708: i/o timeout
[nsqd] 2018/05/28 03:27:59.101940 PROTOCOL(V2): [10.127.0.116:56708] exiting messagePump
Is the timeout coming from the consumer not responding? But whatever is the issue, I don't sure that executing a message in a loop is a good behavior.
Thanks a lot for your insights
From time to time, we're experiencing some weird behavior with our NSQ consumers. It might be related to load pike, we haven't found a way to reproduce it 100% sure, but the issue appears once every week at least in our infrastructure.
So we can see here that between the execution and the
FINmessage, an error occurs concerning the NSQd connection, then theFINfails.It's what's happening in a loop, as after this, the message (with the same ID) is executed again by the process, and fails to
FINagain, etc.When I look at the logging history about
<host1>, I get this:So as you can see it's like if an old broken connection was kept in the pool.
On the
nsqdpart I have:Is the timeout coming from the consumer not responding? But whatever is the issue, I don't sure that executing a message in a loop is a good behavior.
Thanks a lot for your insights