| 38 | | That interval was exactly the same every time it happened - regardless of the |
| 39 | | time of day, regardless of the type of requests, and regardless of server |
| 40 | | load. It happened equally with low traffic in the middle of the night as with |
| 41 | | peak traffic around mid-day. |
| | 32 | - No signs of CPU/RAM saturation at the back-end (no signs of any back-end activity at all during the blackout) |
| | 33 | - Instant recovery: server is immediately back to normal after the blackout (no secondary delays) |
| | 34 | - All front-end workers stuck at the same time, and also recovering at exactly the same time |
| | 35 | - None of the logs showing any irregularities - no errors, no strange requests/responses |
| 46 | | So, four things were puzzling me here: |
| 47 | | 1) always exactly the same length of delay (900sec) - like a preset timeout |
| 48 | | 2) it never hit any software-side timeout, requests were processed normally, |
| 49 | | with proper responses and no errors |
| 50 | | 3) independence of the delay of current server load and request complexity |
| 51 | | 4) all four workers hit at exactly the same time |
| 52 | | |
| 53 | | --- |
| 54 | | |
| 55 | | So, to unmask the problem, I've reduced all software-side timeouts |
| 56 | | (connection, harakiri, idle) to values well below those 900sec - in the hope |
| 57 | | that if it hits any of those timeouts, it would produce an error message |
| 58 | | telling me where to look. |
| 59 | | |
| 60 | | But it just...doesn't. None of the timeouts is ever hit. |
| 61 | | |
| 62 | | Not even harakiri - because in fact, the worker returns with a result in less |
| 63 | | than a second. But /then/ it's stuck, and I can't see why. |
| 64 | | |
| 65 | | The 900sec timeout is most likely tcp_retries2, which is a low-level network |
| 66 | | setting that tells Linux to retry sending a packet over a tcp socket 15 times |
| 67 | | of no DSACK is received within 60sec. That's 15x60=900sec. |
| 68 | | |
| 69 | | But TCP retries would require a connection to be established in the first |
| 70 | | place, and equally, the worker receiving and processing the request /and/ |
| 71 | | producing a response requires a proper incoming request - so the only place |
| 72 | | where it could occur is when sending the response, i.e. HTTP WAIT. |
| 73 | | |
| | 39 | - Blackout length is constant (900sec), and doesn't seem to depend on request complexity |
| | 40 | - Terminating the uWSGI worker that got stuck first ("kill") resolves the problem, and lets all other workers recover instantly |
| | 41 | - Typically occurs after a period of lower traffic, at the moment when traffic increases again |
| | 42 | - At least one of the workers had been inactive for a significant period of time (e.g. 4000+ seconds) before the hanging request |