Consumer is executing message infinitely if for some reason it handles a message it's not the owner

From time to time, we're experiencing some weird behavior with our NSQ consumers. It might be related to load pike, we haven't found a way to reproduce it 100% sure, but the issue appears once every week at least in our infrastructure.

```
2018-05-28 10:18:33.363289812 +0200 CEST [nsqworker-2] account - 09b9e490041f8000 - starting
2018-05-28 10:18:33.383662927 +0200 CEST [nsqworker-2] account - 09b9e490041f8000 - done  - 0.023424769s
2018-05-28 10:19:16.799691171 +0200 CEST [nsqworker-2] W, [2018-05-28T08:19:16.793795 #15]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 10:19:17.069883413 +0200 CEST [nsqworker-2] E, [2018-05-28T08:19:17.065925 #15] ERROR -- : [host: <host1> port: 4150]  Error received: E_FIN_FAILED FIN 09b9e490041f8000 failed client does not own message
```

So we can see here that between the execution and the `FIN` message, an error occurs concerning the NSQd connection, then the `FIN` fails.

It's what's happening in a loop, as after this, the message  (with the same ID) is executed again by the process, and fails to `FIN` again, etc.

When I look at the logging history about `<host1>`, I get this:

```
2018-05-28 05:19:58.395189938 +0200 CEST [nsqworker-2] W, [2018-05-28T03:19:58.386943 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:20:58.460665543 +0200 CEST [nsqworker-2] W, [2018-05-28T03:20:58.457603 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:21:58.546145107 +0200 CEST [nsqworker-2] W, [2018-05-28T03:21:58.545515 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:22:58.625993464 +0200 CEST [nsqworker-2] W, [2018-05-28T03:22:58.621311 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:23:58.741414566 +0200 CEST [nsqworker-2] W, [2018-05-28T03:23:58.733896 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:24:58.835791494 +0200 CEST [nsqworker-2] W, [2018-05-28T03:24:58.829582 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:25:58.932305585 +0200 CEST [nsqworker-2] W, [2018-05-28T03:25:58.921822 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:26:59.021718841 +0200 CEST [nsqworker-2] W, [2018-05-28T03:26:59.017355 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:27:59.121631901 +0200 CEST [nsqworker-2] W, [2018-05-28T03:27:59.113451 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:28:59.221418436 +0200 CEST [nsqworker-2] W, [2018-05-28T03:28:59.209866 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:29:59.330347765 +0200 CEST [nsqworker-2] W, [2018-05-28T03:29:59.327889 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:30:59.410571465 +0200 CEST [nsqworker-2] W, [2018-05-28T03:30:59.409541 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:31:59.545008079 +0200 CEST [nsqworker-2] W, [2018-05-28T03:31:59.543487 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:32:59.773488847 +0200 CEST [nsqworker-2] W, [2018-05-28T03:32:59.770228 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:33:59.877373181 +0200 CEST [nsqworker-2] W, [2018-05-28T03:33:59.871865 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
```

So as you can see it's like if an old broken connection was kept in the pool.

On the `nsqd` part I have:

```
[nsqd] 2018/05/28 03:26:59.033853 [10.127.0.116:56708] IDENTIFY: {ShortID: LongID: ClientID:nsqworker-2 Hostname:nsqworker-2 HeartbeatInterval:30000 OutputBufferSize:0 OutputBufferTimeout:250 FeatureNegotiation:true TLSv1:true Deflate:false DeflateLevel:0 Snappy:false SampleRate:0 UserAgent:nsq-ruby/2.2.0 MsgTimeout:60000}

[nsqd] 2018/05/28 03:27:59.101820 PROTOCOL(V2): [10.127.0.116:56708] exiting ioloop
[nsqd] 2018/05/28 03:27:59.101925 ERROR: client(10.127.0.116:56708) - failed to read command - read tcp 172.17.0.3:4150->10.127.0.116:56708: i/o timeout
[nsqd] 2018/05/28 03:27:59.101940 PROTOCOL(V2): [10.127.0.116:56708] exiting messagePump
```

Is the timeout coming from the consumer not responding? But whatever is the issue, I don't sure that executing a message in a loop is a good behavior.

Thanks a lot for your insights

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer is executing message infinitely if for some reason it handles a message it's not the owner #47

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Consumer is executing message infinitely if for some reason it handles a message it's not the owner #47

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions