Discussion:
ATS slow after ERR_CONNECT_FAIL
Veiko Kukk
2018-03-02 15:18:01 UTC
Permalink
Hi,

ATS 7.1.2, CentOS 7.4.

We have regular requests via ATS in reverse mode to get response time
statistics and decide whether source is available. This request is excluded
from caching, so it always results in TCP_MISS/200 in normal conditions
with response time fluctuation depending on the distance of ATS server from
source.

Sometimes there are network errors that get logged as ERR_CONNECT_FAIL/502,
after which there are again TCP_MISS/200 but with very high latency!
This happens on all ATS servers we have deployed.

For example, when normally before network error response times fluctuate
around 150ms, then after they fluctuate around 1300 ms. After restarting
ATS, response times are back to normal.

Response times from log:
168
12979
2755
117256
56442
40104
1043
30955
30972
18418
26542
4172
8587
59259
4674
37166
16123
67019
41723
3497
6957
18684
17663
14634
20036
1305
14815
3526
10542
62519
22025
40556
36821
2342
2644
3695
7059
45581
1706
30947
16001
136383
1345
72770
1087
1311
1074
1080
1101
1033
1045
1057
1109
1078
1104
1059
1066
1062
1143
1074
1095
1207
1059
1067
1101
1110
1069
1394
1278
1323
1319
1315
1317
1317
1343
1270
1284
1362
1446
1363
1302
1321
1330
1317
1335
1346
1312

and after restart:

593
362
129
381
361
380
394
394
117
384
341
112
88
363
397
93
385
356
371
360
411
120
338
497
449
377
131
458
161
372
112
364
87
129
340
93
114
338
377
454
424
381
381
348
129
336
116
404
108
340
114

Best regards,
Veiko
Veiko Kukk
2018-03-06 13:14:19 UTC
Permalink
Hi again,

This list seems really quiet.
Meanwhile, I've enabled slow log which brings in strange results that I
can't explain. Maybe somebody else can explain it.

status: 200 unique id: redirection_tries: 0 bytes: 20 fd: 0 client state:
0 server state: 9 ua_begin: 0.000 ua_first_read: 0.000 ua_read_header_done:
0.000 cache_open_read_begin: 0.006 cache_open_read_end: 0.920
dns_lookup_begin: 0.920 dns_lookup_end: 0.920 server_connect: 4.204
server_connect_end: 4.204 server_first_read: 43.955
server_read_header_done: 43.955 server_close: 43.957 ua_write: 43.957
ua_close: 43.957 sm_finish: 43.957 plugin_active: 0.000 plugin_total: 0.000
status: 200 unique id: redirection_tries: 0 bytes: 20 fd: 0 client state:
0 server state: 9 ua_begin: 0.000 ua_first_read: 0.000 ua_read_header_done:
0.000 cache_open_read_begin: 0.000 cache_open_read_end: 0.000
dns_lookup_begin: 0.000 dns_lookup_end: 0.000 server_connect: 2.384
server_connect_end: 2.384 server_first_read: 76.550
server_read_header_done: 76.550 server_close: 76.553 ua_write: 76.553
ua_close: 76.553 sm_finish: 76.553 plugin_active: 0.000 plugin_total: 0.000

1) Why there is cache read when cache.config has url_regex=that_url_regex
action=never-cache If the url matches never cache action, why is it looked
up at all in the beginning?
2) What's going on between dns_lookup_end and server_connect? Where is the
time spent? What timeout setting should apply to server_connect? Shouldn't
it be proxy.config.http.connect_attempts_timeout which is set to 2 seconds.
3) What timeout applies to time between server_connect_end
and server_first_read? Isn't
it proxy.config.http.transaction_no_activity_timeout_out which on our case
is set to 6 seconds.

Current config timeouts:

proxy.config.http.transaction_no_activity_timeout_out: 6
proxy.config.http.connect_attempts_timeout: 2
proxy.config.lm.pserver_timeout_secs: 1
proxy.config.lm.pserver_timeout_msecs: 0
proxy.config.cluster.peer_timeout: 30
proxy.config.cluster.mc_poll_timeout: 5
proxy.config.cluster.startup_timeout: 10
proxy.config.process_manager.timeout: 5
proxy.config.vmap.down_up_timeout: 10
proxy.config.http.congestion_control.default.live_os_conn_timeout: 60
proxy.config.http.congestion_control.default.dead_os_conn_timeout: 15
proxy.config.http.parent_proxy.connect_attempts_timeout: 30
proxy.config.http.keep_alive_no_activity_timeout_in: 120
proxy.config.http.keep_alive_no_activity_timeout_out: 120
proxy.config.websocket.no_activity_timeout: 600
proxy.config.websocket.active_timeout: 3600
proxy.config.http.transaction_no_activity_timeout_in: 30
proxy.config.http.transaction_active_timeout_in: 900
proxy.config.http.transaction_active_timeout_out: 0
proxy.config.http.accept_no_activity_timeout: 120
proxy.config.http.background_fill_active_timeout: 0
proxy.config.http.post_connect_attempts_timeout: 1800
proxy.config.socks.socks_timeout: 100
proxy.config.socks.server_connect_timeout: 10
proxy.config.socks.server_retry_timeout: 300
proxy.config.net.poll_timeout: 10
proxy.config.net.default_inactivity_timeout: 86400
proxy.config.dns.lookup_timeout: 20
proxy.config.hostdb.lookup_timeout: 30
proxy.config.hostdb.timeout: 86400
proxy.config.hostdb.fail.timeout: 0
proxy.config.log.collation_host_timeout: 86390
proxy.config.log.collation_client_timeout: 86400
proxy.config.ssl.session_cache.timeout: 0
proxy.config.ssl.handshake_timeout_in: 0
proxy.config.ssl.ocsp.cache_timeout: 3600
proxy.config.ssl.ocsp.request_timeout: 10
proxy.config.http2.accept_no_activity_timeout: 120
proxy.config.http2.no_activity_timeout_in: 120
proxy.config.http2.active_timeout_in: 900

Best regards,
Veiko
Post by Veiko Kukk
Hi,
ATS 7.1.2, CentOS 7.4.
We have regular requests via ATS in reverse mode to get response time
statistics and decide whether source is available. This request is excluded
from caching, so it always results in TCP_MISS/200 in normal conditions
with response time fluctuation depending on the distance of ATS server from
source.
Sometimes there are network errors that get logged
as ERR_CONNECT_FAIL/502, after which there are again TCP_MISS/200 but with
very high latency!
This happens on all ATS servers we have deployed.
For example, when normally before network error response times fluctuate
around 150ms, then after they fluctuate around 1300 ms. After restarting
ATS, response times are back to normal.
168
12979
2755
117256
56442
40104
1043
30955
30972
18418
26542
4172
8587
59259
4674
37166
16123
67019
41723
3497
6957
18684
17663
14634
20036
1305
14815
3526
10542
62519
22025
40556
36821
2342
2644
3695
7059
45581
1706
30947
16001
136383
1345
72770
1087
1311
1074
1080
1101
1033
1045
1057
1109
1078
1104
1059
1066
1062
1143
1074
1095
1207
1059
1067
1101
1110
1069
1394
1278
1323
1319
1315
1317
1317
1343
1270
1284
1362
1446
1363
1302
1321
1330
1317
1335
1346
1312
593
362
129
381
361
380
394
394
117
384
341
112
88
363
397
93
385
356
371
360
411
120
338
497
449
377
131
458
161
372
112
364
87
129
340
93
114
338
377
454
424
381
381
348
129
336
116
404
108
340
114
Best regards,
Veiko
Steven Hunter
2018-03-12 23:10:47 UTC
Permalink
exit any private browsing
Post by Veiko Kukk
Hi again,
This list seems really quiet.
Meanwhile, I've enabled slow log which brings in strange results that I
can't explain. Maybe somebody else can explain it.
0.000 cache_open_read_begin: 0.006 cache_open_read_end: 0.920
dns_lookup_begin: 0.920 dns_lookup_end: 0.920 server_connect: 4.204
server_connect_end: 4.204 server_first_read: 43.955
server_read_header_done: 43.955 server_close: 43.957 ua_write: 43.957
ua_close: 43.957 sm_finish: 43.957 plugin_active: 0.000 plugin_total: 0.000
0.000 cache_open_read_begin: 0.000 cache_open_read_end: 0.000
dns_lookup_begin: 0.000 dns_lookup_end: 0.000 server_connect: 2.384
server_connect_end: 2.384 server_first_read: 76.550
server_read_header_done: 76.550 server_close: 76.553 ua_write: 76.553
ua_close: 76.553 sm_finish: 76.553 plugin_active: 0.000 plugin_total: 0.000
1) Why there is cache read when cache.config has url_regex=that_url_regex
action=never-cache If the url matches never cache action, why is it looked
up at all in the beginning?
2) What's going on between dns_lookup_end and server_connect? Where is the
time spent? What timeout setting should apply to server_connect? Shouldn't
it be proxy.config.http.connect_attempts_timeout which is set to 2 seconds.
3) What timeout applies to time between server_connect_end
and server_first_read? Isn't it proxy.config.http.transaction_no_activity_timeout_out
which on our case is set to 6 seconds.
proxy.config.http.transaction_no_activity_timeout_out: 6
proxy.config.http.connect_attempts_timeout: 2
proxy.config.lm.pserver_timeout_secs: 1
proxy.config.lm.pserver_timeout_msecs: 0
proxy.config.cluster.peer_timeout: 30
proxy.config.cluster.mc_poll_timeout: 5
proxy.config.cluster.startup_timeout: 10
proxy.config.process_manager.timeout: 5
proxy.config.vmap.down_up_timeout: 10
proxy.config.http.congestion_control.default.live_os_conn_timeout: 60
proxy.config.http.congestion_control.default.dead_os_conn_timeout: 15
proxy.config.http.parent_proxy.connect_attempts_timeout: 30
proxy.config.http.keep_alive_no_activity_timeout_in: 120
proxy.config.http.keep_alive_no_activity_timeout_out: 120
proxy.config.websocket.no_activity_timeout: 600
proxy.config.websocket.active_timeout: 3600
proxy.config.http.transaction_no_activity_timeout_in: 30
proxy.config.http.transaction_active_timeout_in: 900
proxy.config.http.transaction_active_timeout_out: 0
proxy.config.http.accept_no_activity_timeout: 120
proxy.config.http.background_fill_active_timeout: 0
proxy.config.http.post_connect_attempts_timeout: 1800
proxy.config.socks.socks_timeout: 100
proxy.config.socks.server_connect_timeout: 10
proxy.config.socks.server_retry_timeout: 300
proxy.config.net.poll_timeout: 10
proxy.config.net.default_inactivity_timeout: 86400
proxy.config.dns.lookup_timeout: 20
proxy.config.hostdb.lookup_timeout: 30
proxy.config.hostdb.timeout: 86400
proxy.config.hostdb.fail.timeout: 0
proxy.config.log.collation_host_timeout: 86390
proxy.config.log.collation_client_timeout: 86400
proxy.config.ssl.session_cache.timeout: 0
proxy.config.ssl.handshake_timeout_in: 0
proxy.config.ssl.ocsp.cache_timeout: 3600
proxy.config.ssl.ocsp.request_timeout: 10
proxy.config.http2.accept_no_activity_timeout: 120
proxy.config.http2.no_activity_timeout_in: 120
proxy.config.http2.active_timeout_in: 900
Best regards,
Veiko
Post by Veiko Kukk
Hi,
ATS 7.1.2, CentOS 7.4.
We have regular requests via ATS in reverse mode to get response time
statistics and decide whether source is available. This request is excluded
from caching, so it always results in TCP_MISS/200 in normal conditions
with response time fluctuation depending on the distance of ATS server from
source.
Sometimes there are network errors that get logged
as ERR_CONNECT_FAIL/502, after which there are again TCP_MISS/200 but with
very high latency!
This happens on all ATS servers we have deployed.
For example, when normally before network error response times fluctuate
around 150ms, then after they fluctuate around 1300 ms. After restarting
ATS, response times are back to normal.
168
12979
2755
117256
56442
40104
1043
30955
30972
18418
26542
4172
8587
59259
4674
37166
16123
67019
41723
3497
6957
18684
17663
14634
20036
1305
14815
3526
10542
62519
22025
40556
36821
2342
2644
3695
7059
45581
1706
30947
16001
136383
1345
72770
1087
1311
1074
1080
1101
1033
1045
1057
1109
1078
1104
1059
1066
1062
1143
1074
1095
1207
1059
1067
1101
1110
1069
1394
1278
1323
1319
1315
1317
1317
1343
1270
1284
1362
1446
1363
1302
1321
1330
1317
1335
1346
1312
593
362
129
381
361
380
394
394
117
384
341
112
88
363
397
93
385
356
371
360
411
120
338
497
449
377
131
458
161
372
112
364
87
129
340
93
114
338
377
454
424
381
381
348
129
336
116
404
108
340
114
Best regards,
Veiko
Loading...