Skip to content

Consumer is executing message infinitely if for some reason it handles a message it's not the owner #47

@Soulou

Description

@Soulou

From time to time, we're experiencing some weird behavior with our NSQ consumers. It might be related to load pike, we haven't found a way to reproduce it 100% sure, but the issue appears once every week at least in our infrastructure.

2018-05-28 10:18:33.363289812 +0200 CEST [nsqworker-2] account - 09b9e490041f8000 - starting
2018-05-28 10:18:33.383662927 +0200 CEST [nsqworker-2] account - 09b9e490041f8000 - done  - 0.023424769s
2018-05-28 10:19:16.799691171 +0200 CEST [nsqworker-2] W, [2018-05-28T08:19:16.793795 #15]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 10:19:17.069883413 +0200 CEST [nsqworker-2] E, [2018-05-28T08:19:17.065925 #15] ERROR -- : [host: <host1> port: 4150]  Error received: E_FIN_FAILED FIN 09b9e490041f8000 failed client does not own message

So we can see here that between the execution and the FIN message, an error occurs concerning the NSQd connection, then the FIN fails.

It's what's happening in a loop, as after this, the message (with the same ID) is executed again by the process, and fails to FIN again, etc.

When I look at the logging history about <host1>, I get this:

2018-05-28 05:19:58.395189938 +0200 CEST [nsqworker-2] W, [2018-05-28T03:19:58.386943 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:20:58.460665543 +0200 CEST [nsqworker-2] W, [2018-05-28T03:20:58.457603 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:21:58.546145107 +0200 CEST [nsqworker-2] W, [2018-05-28T03:21:58.545515 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:22:58.625993464 +0200 CEST [nsqworker-2] W, [2018-05-28T03:22:58.621311 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:23:58.741414566 +0200 CEST [nsqworker-2] W, [2018-05-28T03:23:58.733896 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:24:58.835791494 +0200 CEST [nsqworker-2] W, [2018-05-28T03:24:58.829582 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:25:58.932305585 +0200 CEST [nsqworker-2] W, [2018-05-28T03:25:58.921822 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:26:59.021718841 +0200 CEST [nsqworker-2] W, [2018-05-28T03:26:59.017355 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:27:59.121631901 +0200 CEST [nsqworker-2] W, [2018-05-28T03:27:59.113451 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:28:59.221418436 +0200 CEST [nsqworker-2] W, [2018-05-28T03:28:59.209866 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:29:59.330347765 +0200 CEST [nsqworker-2] W, [2018-05-28T03:29:59.327889 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:30:59.410571465 +0200 CEST [nsqworker-2] W, [2018-05-28T03:30:59.409541 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:31:59.545008079 +0200 CEST [nsqworker-2] W, [2018-05-28T03:31:59.543487 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:32:59.773488847 +0200 CEST [nsqworker-2] W, [2018-05-28T03:32:59.770228 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:33:59.877373181 +0200 CEST [nsqworker-2] W, [2018-05-28T03:33:59.871865 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket

So as you can see it's like if an old broken connection was kept in the pool.

On the nsqd part I have:

[nsqd] 2018/05/28 03:26:59.033853 [10.127.0.116:56708] IDENTIFY: {ShortID: LongID: ClientID:nsqworker-2 Hostname:nsqworker-2 HeartbeatInterval:30000 OutputBufferSize:0 OutputBufferTimeout:250 FeatureNegotiation:true TLSv1:true Deflate:false DeflateLevel:0 Snappy:false SampleRate:0 UserAgent:nsq-ruby/2.2.0 MsgTimeout:60000}

[nsqd] 2018/05/28 03:27:59.101820 PROTOCOL(V2): [10.127.0.116:56708] exiting ioloop
[nsqd] 2018/05/28 03:27:59.101925 ERROR: client(10.127.0.116:56708) - failed to read command - read tcp 172.17.0.3:4150->10.127.0.116:56708: i/o timeout
[nsqd] 2018/05/28 03:27:59.101940 PROTOCOL(V2): [10.127.0.116:56708] exiting messagePump

Is the timeout coming from the consumer not responding? But whatever is the issue, I don't sure that executing a message in a loop is a good behavior.

Thanks a lot for your insights

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions