Discussion:
Ramdisk - 6.2.1 vs 7.0.0
Jeremy Payne
2017-06-06 13:32:17 UTC
Permalink
ATS versions - 6.2.1 and 7.0.0
Cache disk - 18G ramdisk - /dev/ram0
System memory - 32G
Traffic type - HLS Live

While testing live HLS traffic, I noticed that ATS 6.2.1. continued to use
ramdisk until
eventually traffic_server was restarted(via trafic_cop) as result of 'oom
killer'

Testing with ATS 7.0.0 I saw memory use remain stable once the ramdisk
reached it's
'configured' capacity.

I looked through the 7.0 changelog and didnt see anything obvious, so maybe
someone is aware of an undocumented change which impacts ATS 'honoring'
configured ramdisk boundaries.
Understood ATS may not know the difference between /dev/sdX and /dev/ramX..
But just
putting this out to the community just in case I am missing something.

Jeremy
Jeremy Payne
2017-06-08 22:35:20 UTC
Permalink
ATS versions - 6.2.1
Cache disk - 18G ramdisk - /dev/ram0
Storage.config configured to use 14G of ramdisk
System memory - 32G
Traffic type - HLS Live
Connections - sustained 4k
Trafffic - 5-6gbps
Hit ratio - 97%

So it looks like I misread/misinterpreted a few things during testing
referenced in my previous email.
It looks like my issue was never run away memory using ramdisk,
it looks like something internal to ATS that I can't seem to isolate.

Since I am not seeing this type of memory consumption with my
large-file/VOD cache servers,
and the connection profile is much different, I am guessing what may be at
issue is the long living
connections. ATS active timeout is currently set to 3600. This is live
traffic so connections stay open for a hour
and then closed. Where as with my large-file/VOD cache servers,
connections only remain
open for as long as a file/title is being retrieved. 10-15mins max.

Has anyone seen any issues with this type of connection profile and ATS
6.2.1?

This doesnt seem to be an issue with 7.0, so not sure what may have changed
between releases.

Thanks!
Post by Jeremy Payne
ATS versions - 6.2.1 and 7.0.0
Cache disk - 18G ramdisk - /dev/ram0
System memory - 32G
Traffic type - HLS Live
While testing live HLS traffic, I noticed that ATS 6.2.1. continued to use
ramdisk until
eventually traffic_server was restarted(via trafic_cop) as result of 'oom
killer'
Testing with ATS 7.0.0 I saw memory use remain stable once the ramdisk
reached it's
'configured' capacity.
I looked through the 7.0 changelog and didnt see anything obvious, so
maybe someone is aware of an undocumented change which impacts ATS
'honoring' configured ramdisk boundaries.
Understood ATS may not know the difference between /dev/sdX and
/dev/ramX.. But just
putting this out to the community just in case I am missing something.
Jeremy
Bryan Call
2017-06-09 16:02:18 UTC
Permalink
What is your proxy.config.net.default_inactivity_timeout set to? You can also do a dump of the memory by doing:

$ sudo kill -SIGUSR1 $(pidof traffic_server). # and look at traffic.out.

That will give a pretty good idea on where the memory is being used.

-Bryan
Post by Jeremy Payne
ATS versions - 6.2.1
Cache disk - 18G ramdisk - /dev/ram0
Storage.config configured to use 14G of ramdisk
System memory - 32G
Traffic type - HLS Live
Connections - sustained 4k
Trafffic - 5-6gbps
Hit ratio - 97%
So it looks like I misread/misinterpreted a few things during testing referenced in my previous email.
It looks like my issue was never run away memory using ramdisk,
it looks like something internal to ATS that I can't seem to isolate.
Since I am not seeing this type of memory consumption with my large-file/VOD cache servers,
and the connection profile is much different, I am guessing what may be at issue is the long living
connections. ATS active timeout is currently set to 3600. This is live traffic so connections stay open for a hour
and then closed. Where as with my large-file/VOD cache servers, connections only remain
open for as long as a file/title is being retrieved. 10-15mins max.
Has anyone seen any issues with this type of connection profile and ATS 6.2.1?
This doesnt seem to be an issue with 7.0, so not sure what may have changed between releases.
Thanks!
ATS versions - 6.2.1 and 7.0.0
Cache disk - 18G ramdisk - /dev/ram0
System memory - 32G
Traffic type - HLS Live
While testing live HLS traffic, I noticed that ATS 6.2.1. continued to use ramdisk until
eventually traffic_server was restarted(via trafic_cop) as result of 'oom killer'
Testing with ATS 7.0.0 I saw memory use remain stable once the ramdisk reached it's
'configured' capacity.
I looked through the 7.0 changelog and didnt see anything obvious, so maybe someone is aware of an undocumented change which impacts ATS 'honoring' configured ramdisk boundaries.
Understood ATS may not know the difference between /dev/sdX and /dev/ramX.. But just
putting this out to the community just in case I am missing something.
Jeremy
Chou, Peter
2017-06-09 18:50:24 UTC
Permalink
Bryan,

Our default_inactivity_timeout is set to 300s. We have looked at the memory dump in traffic.out as well, but we're stumped because the memory that is tracked in the dump only comes out to about 8GB and 6GB total allocated and used respectively while pmap -x <pid> reports around 26GB memory allocated to the process.

From pmap -- 26127468 11646528 11646094 8773892 11644512 12341772 == [ size rss pss referenced anonymous swap ] (KB)
From traffic.out -- 7895071920 | 5956998576 | | TOTAL (B)

Is this similar to run-away anon region allocation seen before? Appreciate any insights.

Thanks,
Peter


From: Bryan Call [mailto:***@apache.org]
Sent: Friday, June 09, 2017 9:02 AM
To: ***@trafficserver.apache.org
Subject: Re: Ramdisk - 6.2.1 vs 7.0.0

What is your proxy.config.net.default_inactivity_timeout set to? You can also do a dump of the memory by doing:

$ sudo kill -SIGUSR1 $(pidof traffic_server). # and look at traffic.out.

That will give a pretty good idea on where the memory is being used.

-Bryan

On Jun 8, 2017, at 3:35 PM, Jeremy Payne <***@gmail.com<mailto:***@gmail.com>> wrote:

ATS versions - 6.2.1
Cache disk - 18G ramdisk - /dev/ram0
Storage.config configured to use 14G of ramdisk
System memory - 32G
Traffic type - HLS Live
Connections - sustained 4k
Trafffic - 5-6gbps
Hit ratio - 97%

So it looks like I misread/misinterpreted a few things during testing referenced in my previous email.
It looks like my issue was never run away memory using ramdisk,
it looks like something internal to ATS that I can't seem to isolate.

Since I am not seeing this type of memory consumption with my large-file/VOD cache servers,
and the connection profile is much different, I am guessing what may be at issue is the long living
connections. ATS active timeout is currently set to 3600. This is live traffic so connections stay open for a hour
and then closed. Where as with my large-file/VOD cache servers, connections only remain
open for as long as a file/title is being retrieved. 10-15mins max.

Has anyone seen any issues with this type of connection profile and ATS 6.2.1?

This doesnt seem to be an issue with 7.0, so not sure what may have changed between releases.

Thanks!



On Tue, Jun 6, 2017 at 8:32 AM, Jeremy Payne <***@gmail.com<mailto:***@gmail.com>> wrote:
ATS versions - 6.2.1 and 7.0.0
Cache disk - 18G ramdisk - /dev/ram0
System memory - 32G
Traffic type - HLS Live

While testing live HLS traffic, I noticed that ATS 6.2.1. continued to use ramdisk until
eventually traffic_server was restarted(via trafic_cop) as result of 'oom killer'

Testing with ATS 7.0.0 I saw memory use remain stable once the ramdisk reached it's
'configured' capacity.

I looked through the 7.0 changelog and didnt see anything obvious, so maybe someone is aware of an undocumented change which impacts ATS 'honoring' configured ramdisk boundaries.
Understood ATS may not know the difference between /dev/sdX and /dev/ramX.. But just
putting this out to the community just in case I am missing something.

Jeremy
Bryan Call
2017-06-10 15:11:52 UTC
Permalink
Do you have TLS connections? (https). You can use tcmalloc or jemalloc to profile the memory allocation.

-Bryan
Post by Chou, Peter
Bryan,
Our default_inactivity_timeout is set to 300s. We have looked at the memory dump in traffic.out as well, but we’re stumped because the memory that is tracked in the dump only comes out to about 8GB and 6GB total allocated and used respectively while pmap -x <pid> reports around 26GB memory allocated to the process.
From pmap -- 26127468 11646528 11646094 8773892 11644512 12341772 == [ size rss pss referenced anonymous swap ] (KB)
From traffic.out -- 7895071920 | 5956998576 | | TOTAL (B)
Is this similar to run-away anon region allocation seen before? Appreciate any insights.
Thanks,
Peter
Sent: Friday, June 09, 2017 9:02 AM
Subject: Re: Ramdisk - 6.2.1 vs 7.0.0
$ sudo kill -SIGUSR1 $(pidof traffic_server). # and look at traffic.out.
That will give a pretty good idea on where the memory is being used.
-Bryan
ATS versions - 6.2.1
Cache disk - 18G ramdisk - /dev/ram0
Storage.config configured to use 14G of ramdisk
System memory - 32G
Traffic type - HLS Live
Connections - sustained 4k
Trafffic - 5-6gbps
Hit ratio - 97%
So it looks like I misread/misinterpreted a few things during testing referenced in my previous email.
It looks like my issue was never run away memory using ramdisk,
it looks like something internal to ATS that I can't seem to isolate.
Since I am not seeing this type of memory consumption with my large-file/VOD cache servers,
and the connection profile is much different, I am guessing what may be at issue is the long living
connections. ATS active timeout is currently set to 3600. This is live traffic so connections stay open for a hour
and then closed. Where as with my large-file/VOD cache servers, connections only remain
open for as long as a file/title is being retrieved. 10-15mins max.
Has anyone seen any issues with this type of connection profile and ATS 6.2.1?
This doesnt seem to be an issue with 7.0, so not sure what may have changed between releases.
Thanks!
ATS versions - 6.2.1 and 7.0.0
Cache disk - 18G ramdisk - /dev/ram0
System memory - 32G
Traffic type - HLS Live
While testing live HLS traffic, I noticed that ATS 6.2.1. continued to use ramdisk until
eventually traffic_server was restarted(via trafic_cop) as result of 'oom killer'
Testing with ATS 7.0.0 I saw memory use remain stable once the ramdisk reached it's
'configured' capacity.
I looked through the 7.0 changelog and didnt see anything obvious, so maybe someone is aware of an undocumented change which impacts ATS 'honoring' configured ramdisk boundaries.
Understood ATS may not know the difference between /dev/sdX and /dev/ramX.. But just
putting this out to the community just in case I am missing something.
Jeremy
Jeremy Payne
2017-06-10 15:37:33 UTC
Permalink
ATS is configured to listen on 443:ssl. But there is no certificate called
in ssl_multicert or other. So https requests
will fail. Plus we are not sending https requests during these test runs.
Post by Bryan Call
Do you have TLS connections? (https). You can use tcmalloc or jemalloc to
profile the memory allocation.
-Bryan
Bryan,
Our default_inactivity_timeout is set to 300s. We have looked at the
memory dump in traffic.out as well, but we’re stumped because the memory
that is tracked in the dump only comes out to about 8GB and 6GB total
allocated and used respectively while pmap -x <pid> reports around 26GB
memory allocated to the process.
From pmap -- 26127468 11646528 11646094 8773892 11644512 12341772 ==
[ size rss pss referenced anonymous swap ] (KB)
From traffic.out -- 7895071920 | 5956998576 | | TOTAL (B)
Is this similar to run-away anon region allocation seen before? Appreciate any insights.
Thanks,
Peter
*Sent:* Friday, June 09, 2017 9:02 AM
*Subject:* Re: Ramdisk - 6.2.1 vs 7.0.0
What is your proxy.config.net.default_inactivity_timeout set to? You can
$ sudo kill -SIGUSR1 $(pidof traffic_server). # and look at traffic.out.
That will give a pretty good idea on where the memory is being used.
-Bryan
ATS versions - 6.2.1
Cache disk - 18G ramdisk - /dev/ram0
Storage.config configured to use 14G of ramdisk
System memory - 32G
Traffic type - HLS Live
Connections - sustained 4k
Trafffic - 5-6gbps
Hit ratio - 97%
So it looks like I misread/misinterpreted a few things during testing
referenced in my previous email.
It looks like my issue was never run away memory using ramdisk,
it looks like something internal to ATS that I can't seem to isolate.
Since I am not seeing this type of memory consumption with my
large-file/VOD cache servers,
and the connection profile is much different, I am guessing what may be at
issue is the long living
connections. ATS active timeout is currently set to 3600. This is live
traffic so connections stay open for a hour
and then closed. Where as with my large-file/VOD cache servers, connections only remain
open for as long as a file/title is being retrieved. 10-15mins max.
Has anyone seen any issues with this type of connection profile and ATS 6.2.1?
This doesnt seem to be an issue with 7.0, so not sure what may have
changed between releases.
Thanks!
ATS versions - 6.2.1 and 7.0.0
Cache disk - 18G ramdisk - /dev/ram0
System memory - 32G
Traffic type - HLS Live
While testing live HLS traffic, I noticed that ATS 6.2.1. continued to use ramdisk until
eventually traffic_server was restarted(via trafic_cop) as result of 'oom killer'
Testing with ATS 7.0.0 I saw memory use remain stable once the ramdisk reached it's
'configured' capacity.
I looked through the 7.0 changelog and didnt see anything obvious, so
maybe someone is aware of an undocumented change which impacts ATS
'honoring' configured ramdisk boundaries.
Understood ATS may not know the difference between /dev/sdX and /dev/ramX.. But just
putting this out to the community just in case I am missing something.
Jeremy
Chou, Peter
2017-06-16 18:13:25 UTC
Permalink
Bryan,

Turns out that this memory leak is the result of configuring a log collation host in “logs_xml.config” and then having the host be down during the traffic run. We verified that this occurs with vanilla 6.2.1. We also noticed that there was a memory leak fix in this function patched in TS-4872, which is present in 6.2.x, 7.1.x, and master. So we tried the latest 6.2.x (current) from yesterday, but the memory leak still occurred. We have also seen that bringing up the log host during the test run stabilizes the memory allocation so we’re pretty sure that this is the trigger for the memory leak.

Some heap-check output from ATS compiled with tcmalloc --
Total: 3014.2 MB
2902.3 96.3% 96.3% 2902.3 96.3% ats_malloc
76.0 2.5% 98.8% 76.0 2.5% ats_memalign
35.8 1.2% 100.0% 2938.1 97.5% LogObject::_checkout_write
0.1 0.0% 100.0% 0.1 0.0% BaseLogFile::open_file
0.0 0.0% 100.0% 0.0 0.0% ResourceTracker::lookup (inline)

Thanks,
Peter


From: Bryan Call [mailto:***@apache.org]
Sent: Saturday, June 10, 2017 8:12 AM
To: ***@trafficserver.apache.org
Subject: Re: Ramdisk - 6.2.1 vs 7.0.0

Do you have TLS connections? (https). You can use tcmalloc or jemalloc to profile the memory allocation.

-Bryan

On Jun 9, 2017, at 11:50 AM, Chou, Peter <***@labs.att.com<mailto:***@labs.att.com>> wrote:

Bryan,

Our default_inactivity_timeout is set to 300s. We have looked at the memory dump in traffic.out as well, but we’re stumped because the memory that is tracked in the dump only comes out to about 8GB and 6GB total allocated and used respectively while pmap -x <pid> reports around 26GB memory allocated to the process.

From pmap -- 26127468 11646528 11646094 8773892 11644512 12341772 == [ size rss pss referenced anonymous swap ] (KB)
From traffic.out -- 7895071920 | 5956998576 | | TOTAL (B)

Is this similar to run-away anon region allocation seen before? Appreciate any insights.

Thanks,
Peter


From: Bryan Call [mailto:***@apache.org]
Sent: Friday, June 09, 2017 9:02 AM
To: ***@trafficserver.apache.org<mailto:***@trafficserver.apache.org>
Subject: Re: Ramdisk - 6.2.1 vs 7.0.0

What is your proxy.config.net.default_inactivity_timeout set to? You can also do a dump of the memory by doing:

$ sudo kill -SIGUSR1 $(pidof traffic_server). # and look at traffic.out.

That will give a pretty good idea on where the memory is being used.

-Bryan

On Jun 8, 2017, at 3:35 PM, Jeremy Payne <***@gmail.com<mailto:***@gmail.com>> wrote:

ATS versions - 6.2.1
Cache disk - 18G ramdisk - /dev/ram0
Storage.config configured to use 14G of ramdisk
System memory - 32G
Traffic type - HLS Live
Connections - sustained 4k
Trafffic - 5-6gbps
Hit ratio - 97%

So it looks like I misread/misinterpreted a few things during testing referenced in my previous email.
It looks like my issue was never run away memory using ramdisk,
it looks like something internal to ATS that I can't seem to isolate.

Since I am not seeing this type of memory consumption with my large-file/VOD cache servers,
and the connection profile is much different, I am guessing what may be at issue is the long living
connections. ATS active timeout is currently set to 3600. This is live traffic so connections stay open for a hour
and then closed. Where as with my large-file/VOD cache servers, connections only remain
open for as long as a file/title is being retrieved. 10-15mins max.

Has anyone seen any issues with this type of connection profile and ATS 6.2.1?

This doesnt seem to be an issue with 7.0, so not sure what may have changed between releases.

Thanks!



On Tue, Jun 6, 2017 at 8:32 AM, Jeremy Payne <***@gmail.com<mailto:***@gmail.com>> wrote:
ATS versions - 6.2.1 and 7.0.0
Cache disk - 18G ramdisk - /dev/ram0
System memory - 32G
Traffic type - HLS Live

While testing live HLS traffic, I noticed that ATS 6.2.1. continued to use ramdisk until
eventually traffic_server was restarted(via trafic_cop) as result of 'oom killer'

Testing with ATS 7.0.0 I saw memory use remain stable once the ramdisk reached it's
'configured' capacity.

I looked through the 7.0 changelog and didnt see anything obvious, so maybe someone is aware of an undocumented change which impacts ATS 'honoring' configured ramdisk boundaries.
Understood ATS may not know the difference between /dev/sdX and /dev/ramX.. But just
putting this out to the community just in case I am missing something.

Jeremy
Yongming Zhao
2017-06-17 10:23:35 UTC
Permalink
good job, log collation host down should only switch to orphan logging instead, that memory leak is not as expected, please continue your investigation and file the issue in the github with some details.

patch welcome, as always

- Yongming Zhao ÕÔÓÀÃ÷
Post by Chou, Peter
Bryan,
Turns out that this memory leak is the result of configuring a log collation host in ¡°logs_xml.config¡± and then having the host be down during the traffic run. We verified that this occurs with vanilla 6.2.1. We also noticed that there was a memory leak fix in this function patched in TS-4872, which is present in 6.2.x, 7.1.x, and master. So we tried the latest 6.2.x (current) from yesterday, but the memory leak still occurred. We have also seen that bringing up the log host during the test run stabilizes the memory allocation so we¡¯re pretty sure that this is the trigger for the memory leak.
Some heap-check output from ATS compiled with tcmalloc --
Total: 3014.2 MB
2902.3 96.3% 96.3% 2902.3 96.3% ats_malloc
76.0 2.5% 98.8% 76.0 2.5% ats_memalign
35.8 1.2% 100.0% 2938.1 97.5% LogObject::_checkout_write
0.1 0.0% 100.0% 0.1 0.0% BaseLogFile::open_file
0.0 0.0% 100.0% 0.0 0.0% ResourceTracker::lookup (inline)
Thanks,
Peter
Sent: Saturday, June 10, 2017 8:12 AM
Subject: Re: Ramdisk - 6.2.1 vs 7.0.0
Do you have TLS connections? (https). You can use tcmalloc or jemalloc to profile the memory allocation.
-Bryan
Bryan,
Our default_inactivity_timeout is set to 300s. We have looked at the memory dump in traffic.out as well, but we¡¯re stumped because the memory that is tracked in the dump only comes out to about 8GB and 6GB total allocated and used respectively while pmap -x <pid> reports around 26GB memory allocated to the process.
From pmap -- 26127468 11646528 11646094 8773892 11644512 12341772 == [ size rss pss referenced anonymous swap ] (KB)
From traffic.out -- 7895071920 | 5956998576 | | TOTAL (B)
Is this similar to run-away anon region allocation seen before? Appreciate any insights.
Thanks,
Peter
Sent: Friday, June 09, 2017 9:02 AM
Subject: Re: Ramdisk - 6.2.1 vs 7.0.0
$ sudo kill -SIGUSR1 $(pidof traffic_server). # and look at traffic.out.
That will give a pretty good idea on where the memory is being used.
-Bryan
ATS versions - 6.2.1
Cache disk - 18G ramdisk - /dev/ram0
Storage.config configured to use 14G of ramdisk
System memory - 32G
Traffic type - HLS Live
Connections - sustained 4k
Trafffic - 5-6gbps
Hit ratio - 97%
So it looks like I misread/misinterpreted a few things during testing referenced in my previous email.
It looks like my issue was never run away memory using ramdisk,
it looks like something internal to ATS that I can't seem to isolate.
Since I am not seeing this type of memory consumption with my large-file/VOD cache servers,
and the connection profile is much different, I am guessing what may be at issue is the long living
connections. ATS active timeout is currently set to 3600. This is live traffic so connections stay open for a hour
and then closed. Where as with my large-file/VOD cache servers, connections only remain
open for as long as a file/title is being retrieved. 10-15mins max.
Has anyone seen any issues with this type of connection profile and ATS 6.2.1?
This doesnt seem to be an issue with 7.0, so not sure what may have changed between releases.
Thanks!
ATS versions - 6.2.1 and 7.0.0
Cache disk - 18G ramdisk - /dev/ram0
System memory - 32G
Traffic type - HLS Live
While testing live HLS traffic, I noticed that ATS 6.2.1. continued to use ramdisk until
eventually traffic_server was restarted(via trafic_cop) as result of 'oom killer'
Testing with ATS 7.0.0 I saw memory use remain stable once the ramdisk reached it's
'configured' capacity.
I looked through the 7.0 changelog and didnt see anything obvious, so maybe someone is aware of an undocumented change which impacts ATS 'honoring' configured ramdisk boundaries.
Understood ATS may not know the difference between /dev/sdX and /dev/ramX.. But just
putting this out to the community just in case I am missing something.
Jeremy
Loading...