Sizing post #3: Understanding percentiles

Hello, here we are now on post #3 on my favorite topic: “sizing”.

So far, we have learned how to standardize data coming from ASH to use it for sizing. Sizing post #1: How to standardize ASH data for sizing.

In our previous post we talked about average (mean), median, minimum, and maximum. Sizing post #2: The not so good mean, the bad median, and the ugly minimum and maximum where we realized that those statistics are not the way to account for the number of CPUs required for our database workload because the CPU utilization can go up drastically and we may not have enough processing power. Nobody wants their system to come to a halt, right?

Here we are going to explain percentiles and see how we can use it to attempt to come up with the number of CPUs required to support our example workload.

A short and good explanation is in the article “How to Calculate Percentiles in Statistics by Deborah J. Rumsey”

Basically the percentile is … well … instead of writing here what it is, lets just try to come up with the definition ourselves with an example.

Lets start with a data set, which can be our own cpu_data column values.

2, 3, 1, 5, 10, 25, 2, 1, 9, 3, 2, 2, 2, 1, 2, 3, 5, 2, 2, 2

We have 20 values in our data set.

We order them ascending and we get this list:

1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 5, 5, 9, 10, 25

Lets decide that we want to work with 90th percentile.

What does that mean? It means that we want to find out what value in the list will have 10% of the values above it (to its right) and 90% of values below it (to its left).

If I have 20 values and that is 100% of my values then the 90% of values is 18, if I assign positional numbers to my list the 18 bucket will be the 9.

Value:    1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 5, 5, 9, 10, 25
Position: 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18  19  20

Because I need to get the value in the middle of position between 18 and 19 I get the average between position 18 and 19 which are values 9 and 10, its average is ( 9 + 10 ) / 2 = 9.5

By formula:

To get the position:
Value # = ( Percentile / 100 ) * (n + 1) = ( 90 / 100 ) * ( 21 ) = 18.9

There is no position 18.9 thus I just get the average between the position 18 and 19 ( 9 + 10 ) / 2 = 9.5
Note that the method I am using calculates the average but I could also extrapolate to be more accurate.

For this data set the values represent CPUs therefore the decimal will not apply and I would round up to 10.

With this data set we conclude that:

The 90th percentile is 9.5 rounding it up to 10
 90% of the values are below 10
 10% of the values are above 10

To conclude our little section explaining percentiles here the definition of percentile:

Percentile (in our case the 9.5): is the value below which a percentage (in our case 90%) of data falls.

This data set is too small to be practical to decide what cpu requirement is needed. When we get to calculate percentiles for a real data set like our example with 69,120 values we would have a better approximation of what we need in terms of number of CPUs for our workload.

For our convenience we use the Oracle function PERCENTILE_DISC(0.9) to get the 90th percentile.

We are running this against our data set of 69,120 data points

SELECT PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY cpu_data) "Percentile_Disc"
FROM   cpu_standard;

Percentile_Disc
---------------
              5

SELECT PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY cpu_data) "Percentile_Cont"
FROM   cpu_standard;

Percentile_Cont
---------------
              5

In this case we got the same value of 5, however, there is a difference between PERCENTILE_DISC (discrete) and PERCENTILE_CONT (continuous). The discrete function returns a value that exists in the data set and the continuous function returns an interpolated value that is not necessarily in the data set.

How do we interpret our result of 5?

The 90th percentile is 5
90% of the values are below 5
10% of the values are above 5

For the 97th percentile:

SELECT ROUND(PERCENTILE_CONT(0.97) WITHIN GROUP (ORDER BY cpu_data)) "Percentile_Cont"
FROM   cpu_standard;

Percentile_Cont
---------------
              9

The 97th percentile is 9
97% of the values are below 9
 3% of the values are above 9

I am using PERCENTILE_CONT(0.97) instead PERCENTILE_DISC(0.97) because I want to get a more accurate value even if the value does not exist in the data set, then I can round it up with the ROUND function.

The interpretation in terms of CPU utilization

We can say that for my whole data set of 69,120 data points the 97th percentile is 9 and also that 3% of the values in my data set are above 9. In this case if I were to move my database workload to a system with 9 CPUs then 3% of the time I would be using more than those 9 CPUs and there will be waits. The problem here is that we are using the whole data set and we do not know the details of the distribution of the CPU utilization, meaning that we do not know if that 3% above 9 CPUs is condensed in just a few hours or spread across the 8 days of collected data.

Why is important to know how the utilization is distributed?

We have 8 days of data which in hours is (8 * 24) = 192 hours, 3% of 192 hours is 5.76 ~ 6 hours
Imagine that the 3% above the 9 CPUs utilization is condensed in only 6 hours instead of spread across the 8 days. We would have the system to come to a halt when that first hour of CPU utilization starts using above 9 CPUs which is the total capacity of our imaginary new system.

Using the percentiles for the whole data set is not a good option to know the CPU required for our database workload, however, we still can use percentiles, can’t we? What if we get percentiles by the hour and somehow overcome the problem of not knowing the distribution of the CPU utilization?

Let’s see…

How can we get the 97th percentile for each hour?

SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.97) WITHIN GROUP (ORDER BY cpu_data)) "97th PERCENTILE"
FROM   cpu_standard
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 1;

THE_HOUR      97th PERCENTILE
------------- ---------------
2016-09-06T19               6
2016-09-06T20               5
2016-09-06T21               4
2016-09-06T22               5
2016-09-06T23               4
...
...
2016-09-14T06               6
2016-09-14T07               9
2016-09-14T08              10
2016-09-14T09              14
2016-09-14T10              12
2016-09-14T11               6
2016-09-14T12               6
2016-09-14T13               3
2016-09-14T14              10
2016-09-14T15               7
2016-09-14T16               2
2016-09-14T17               3
2016-09-14T18               0

192 rows selected.

Lets materialize these results in a table and create a couple of charts

CREATE TABLE cpu_standard_hour_perc
AS SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.97) WITHIN GROUP (ORDER BY cpu_data)) "97th PERCENTILE"
FROM   cpu_standard
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 1;

DESC cpu_standard_hour_perc

Name                  Null?    Type
--------------------- -------- ----------------------
THE_HOUR                       VARCHAR2(13)
97th PERCENTILE                NUMBER

percentile97th
This chart above shows in the horizontal line the time and it is for every 3 hours, the chart below is a close up that shows every hour in a specific section of the chart above.
percentile97th-closeup
How do we interpret these results?

We got one row per hour. We have 192 hours in our data set and we have calculated the 97th percentile for each hour.

Lets grab the first row that is showing in the output:

THE_HOUR      97th PERCENTILE
------------- ---------------
2016-09-06T19               6

During the hour from 7pm to 8pm on 06-SEP-2016:

The 97th percentile is 6
97% of the values are below 6
 3% of the values are above 6

Here is where we get into more detail on what is going on within the hour. Remember that our samples are taken every 10 seconds, and that means that in one hour (there are 3,600 seconds in one hour) we have 360 samples equidistant 10 seconds from each other. If 3% of the values are above 6 in one hour, that means I had CPU utilization above 6 during 1.8 minutes. If those were to be contiguous which would be the worst case I would have the system on wait for 1.8 minutes.

Now that I have one 97th percentile value for each hour I can calculate the maximum 97th percentile value for the whole data set grouped by the hour.

SELECT * FROM
(
SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.97) WITHIN GROUP (ORDER BY cpu_data)) "97th PERCENTILE"
FROM   cpu_standard
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <= 1;

THE_HOUR      97th PERCENTILE
------------- ---------------
2016-09-07T10              15

In terms of the 97th percentile the result from this last query would be the worst hour or in other words the busiest hour in which:

The 97th percentile is 15
97% of the values are below 15
 3% of the values are above 15

In order to fulfill this requirement we would need 15 CPUs and we still have 1.8 minutes wait during this particular hour.

It is a good time to mention here that when we size we do not size to the 100%, in this case if we need 15 CPUs my system would need to leave extra room for unexpected spikes in utilization. In order to protect the system with a cushion of lets say 30% we would need 21.4 CPUs

because if 15 CPUs is 70% then (15 * 100) / 70 = 21.4 CPUs

In this case we have come up with the number 15 for the cpu requirement for our database workload because we don’t want waits over 1.8 minutes within the hour using the 97th percentile for each hour and getting the busiest hour.

What we are really doing here is cropping spikes in the CPU utilization. If we remember we had a maximum of 27 CPUs

SELECT * FROM 
(
SELECT the_hour, cpu_data
FROM cpu_standard_hourly
ORDER BY 2 DESC
)
WHERE rownum <=1;

THE_HOUR         CPU_DATA
------------- -----------
2016-09-14T13          27

and we are saying that we would have enough with 15 CPUs. This is because the 27 was a one time sample of 27 active sessions and the CPU utilization went down suddenly and there were no long waits. We don’t really need 27 CPUs, if the system is being asked for 27 CPUs and it does not have them it will queue the tasks and could be waits depending on how long the CPU resource is required which in this case was not longer than the 10 seconds sample interval.

We have showed that using percentiles is a good method to measure the CPU required and can help us identify the busiest hours in our database workload.

If we were willing to have longer waits we could use the 95th percentile instead of the 97th percentile. With the 95th percentile means we get a CPU value on which 95% of values fall below and 5% are above. This means that we can have waits for 5% of the hour; 5% of each hour is 3 minutes.

We can build a table to show how long the waits could be according to each percentile value:

97th percentile -> longest wait is 1.8 min.
95th percenitle -> longest wait is  3  min.
90th percentile -> longest wait is  6  min.
80th percentile -> longest wait is 12  min.

Question is: How do we pick the percentile that is appropriate for my database workload?

Answer: It depends on how much pain (waits) we are willing to tolerate…

Stay tuned for the next blog post on sizing matters: “Sizing post #4: How much pain are we willing to tolerate?”

Author: Jorge Barba

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s