Sizing post #5: How to map our existing processing power to new hardware.

In our last blog post “Sizing post #4: How much pain are we willing to tolerate?” we talked about how much pain we can tolerate so we can decide on which percentile we want to size our database workload. It is not a size fits all because it depends on the database workload profile, some databases would be fine with 90th percentile, some others would need 97th, 98th or even 99th percentile.

Once we have decided the number of CPUs required for our database workload we know its footprint and we know more of how the resources are being used. Using this information we can move our database to another hardware and be confident that the target hardware is going to be able to support our database workload.

When we move from a system to another system with different processor or even same processor but different vendor the processing power may not be the same. When we move to a different vendor but the same processor the difference may not be relevant and we could map 1 to 1, meaning, if we know our CPU required is 32, then a target system from different vendor but same processor most likely will be fine with also 32 CPUs, however, if we move to a system with different processor, we most likely need a way to map the number of CPUs required to the new hardware as the processing power may differ greatly.

There could be that the processing power of the new server is more powerful than the existing server. In this case we would need to map the processing power. For example if the new server is twice as powerful then we map 2 to 1. In other words if we had a server where my CPU required is 32, then in the new server I would need 16 CPUs.

On the other hand if the new server has half the processing power of the existing server we map 1 to 2 requiring now 64 CPUs for my imaginary example database workload.

Question is: How to map processing power between existing hardware and new hardware?

Answer: One way to map the existing processing power to a new server processing power is using the RESULT value from SPEC.

“The Standard Performance Evaluation Corporation (SPEC) is a non-profit corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers. SPEC develops benchmark suites and also reviews and publishes submitted results from our member organizations and other benchmark licensees.”

Reference: https://www.spec.org

SPEC has several benchmarks to test the servers. We are using the results from the benchmark called SPEC CPU2006 CINT2006.

“Designed to provide performance measurements that can be used to compare compute-intensive workloads on different computer systems, SPEC CPU2006 contains two benchmark suites: CINT2006 for measuring and comparing compute-intensive integer performance, and CFP2006 for measuring and comparing compute-intensive floating point performance.”

Lets use an example to explain the mapping.

For our example we are using a BL460c Gen9 server as our existing hardware and a DL360 Gen9 as our new hardware.

Source (existing) server: BL460c Gen9
Target (new)      server: DL360  Gen9

SYSTEM                                                       CORES THREADS       CPUs RESULT       RP_CORE      RP_CPU
------------------------------------------------------------ ----- --------- -------- -------- ----------- -----------
ProLiant DL360 Gen9 (2.40 GHz, Intel Xeon E5-2640 v3)        16    2               32 749             46.8        23.4
ProLiant BL460c Gen9 (2.30 GHz, Intel Xeon E5-2699 v3)       18    2               36 678             37.7        18.8

Assume we identified we required 24 CPUs for our database workload when running our database on a BL460c Gen9 server with the Intel Xeon E5-2699 v3 processor, which has a “result per cpu thread” of 18.8.

How many CPU threads are needed in the new DL360 Gen9 with Intel Xeon E5-2640 v3 to cover for those 24 CPUs?

Because the “result per cpu thread” is higher than the one in the existing hardware we can start thinking that we would need less CPUs in the new hardware.

Lets start by answering how much more powerful is the new hardware compared to the existing hardware?

We can answer that by dividing the RP_CPU of the new hardware by the RP_CPU of the existing hardware

  23.4 / 18.8 = 1.24

The new hardware is 1.24 times more powerful than the existing hardware.

In order to get the ratio we divide the RP_CPU of the old hardware by the RP_CPU of the new hardware

  18.8 / 23.4 = 0.8034

Then we multiply the number of required CPUs (24) times the ratio

  24 * (18.8 / 23.4) = 19.28

There is our answer. We need 19.28 CPUs in the new DL 360 server to cover for the 24 CPUs in the existing BL460 server.

Obviously we can’t have fractions of a CPU so we just say 20 CPUs.

24 CPUs in a BL460c    ~    20 CPUs in a DL360

When doing this type of conversion make sure you have the correct precessor because the servers BL460c and DL360 could be configured with several choices and each would make a difference in the “result per cpu thread” of the server.

The value RESULT which is the number that comes out of the benchmark from SPEC is a comparative value among servers. We can divide that to get “result per cpu thread” and be able to come up with an equivalent number of CPUs from one server to another.

Remember that we have been working with required CPUs which is the number of CPUs we found out from looking at and analyzing the standardized ASH/AWR data and that value is representing the database workload, however, we still need to account for headroom for unexpected spikes and potential growth. For example if we required 20 CPUs in our target server DL360 and we want a 30% headroom for unexpected spikes and growth, this means our 20 CPUs represent 70%, in order to get a total we calculate like this

 Total num CPUs =  ( 20 * 100) / 70 = 28.5 ~ 29 CPUs

For our example we would need 29 CPUs

When doing a sizing for the cloud we are going to need the specifications of the server that is in the cloud. If we know our required number of CPUs for our database workload but we don’t know the server model and processor of the hardware used in the cloud it would not be possible to use the SPEC data to map between processing power from source to target servers.

Lets assume that the cloud is using a DL360 Gen9 with Intel Xeon E5-2640 v3 processor. In this case we would look for a service that provides at least the 29 CPUs that we need.

Stay tuned for more on sizing…

Author: Jorge Barba

Sizing post #4: How much pain are we willing to tolerate?

We have gone through 3 posts already and have learned how to standardize ASH data for sizing, reviewed some basic statistics like mean, median, maximum and minimum and the use of percentiles to calculate the CPU requirement for a single instance database.

In this post we want to show a way to figure out what is the appropriate percentile that we can use to size our database workload.

We have been using a single instance Oracle Database for our examples and we are going to continue to use single instance database workload to keep things simple for now. The example we are going to use is a different database workload with a lot less active sessions to make it even simpler to understand.

Table: CPU_STANDARD_EX2 (stands for example 2)

Lets grab some statistics on this database workload.

SELECT MIN(sample_time) START_TIME, MAX(sample_time) END_TIME, ROUND(MAX(sample_time) - MIN(sample_time)) Retention
FROM cpu_standard_ex2;

START_TIM END_TIME    RETENTION
--------- --------- -----------
04-AUG-16 19-AUG-16          15

We had our retention period set to 15 days thus we have data collected for 15 days starting Aug 4th and ending Aug 19th.

SELECT MIN(cpu_data) MINIMUM, MAX(cpu_data) MAXIMUM
FROM   cpu_standard_ex2;

    MINIMUM     MAXIMUM
----------- -----------
          0           9

This database has a minimum CPU utilization of 0 and a maximum of 9 active sessions.

SELECT AVG(cpu_data) AVERAGE, MEDIAN(cpu_data) MEDIAN
FROM   cpu_standard_ex2;

    AVERAGE      MEDIAN
----------- -----------
.3778628907           0

The average is not even 1 active session and the median is 0. The median is the 50th percentile and we can state that the most popular value in this database workload is 0.

To have a visualization we can summarize the data into hourly data points with the maximum value per hour.

SELECT   TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') THE_HOUR, MAX(cpu_data) CPU_DATA
FROM     cpu_standard_ex2
GROUP BY TRUNC(sample_time,'HH24')
ORDER BY 1;

THE_HOUR         CPU_DATA
------------- -----------
2016-08-04T06           1
2016-08-04T07           1
2016-08-04T08           1
2016-08-04T09           1
...
...
2016-08-18T21           2
2016-08-18T22           3
2016-08-18T23           2
2016-08-19T00           2
2016-08-19T01           2
2016-08-19T02           1
2016-08-19T03           1
2016-08-19T04           1

359 rows selected.

We create a table with the hourly cpu data to create a line chart in excel as explained in Sizing post #2: The not so good mean, the bad median, and the ugly minimum and maximum.

CREATE TABLE cpu_standard_ex2_hourly AS
SELECT   TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') THE_HOUR, MAX(cpu_data) CPU_DATA
FROM     cpu_standard_ex2
GROUP BY TRUNC(sample_time,'HH24')
ORDER BY 1;

Here is the chart for our example 2 database workload:
hourly_ex2
Because we are summarizing with the maximum, most of the data points will not be 0 but at least 1, however, remember that it does not mean that the whole hour was 1 active session. Lets take a look at some detail on a specific hour.

CREATE TABLE cpu_standard_ex2_aug_16 AS
SELECT   TO_CHAR(sample_time, 'YYYY-MM-DD"T"HH24:MI:SS') THE_HOUR, cpu_data
FROM     cpu_standard_ex2
WHERE    TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') = '2016-08-16T05' 
ORDER BY 1;

example2_10s
See how the majority of our data points are 0 with some at 1.

Lets get the hour where we had our maximum of 9.

SELECT * FROM
(
SELECT   TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') THE_HOUR, MAX(cpu_data) CPU_DATA
FROM     cpu_standard_ex2
GROUP BY TRUNC(sample_time,'HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <=1;

THE_HOUR         CPU_DATA
------------- -----------
2016-08-06T05           9

Generating table and chart:

CREATE TABLE cpu_standard_ex2_aug_06 AS
SELECT   TO_CHAR(sample_time, 'YYYY-MM-DD"T"HH24:MI:SS') THE_HOUR, cpu_data
FROM     cpu_standard_ex2
WHERE    TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') = '2016-08-06T05' 
ORDER BY 1;

example2_spike
There is a spike in CPU utilization up to 9 active sessions on 2016-08-06 at 05:31:20. This is a one 10 second sample that had 9 active sessions, around that time the cpu utilization was 1 and 2.

It won’t be realistic to size with 9 because the workload is usually 0, 1 or 2 and it would be too much resource to size to the maximum of 9.

If we remember from our previous post

97th percentile -> longest wait is 1.8 min.
95th percentile -> longest wait is  3  min.
90th percentile -> longest wait is  6  min.
80th percentile -> longest wait is 12  min.

If we use our 95th percentile because we are willing to tolerate a maximum of 3 min wait, the calculation will be:

SELECT * FROM
(
SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY cpu_data)) "95th PERCENTILE"
FROM   cpu_standard_ex2
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <= 1;

THE_HOUR      95th PERCENTILE
------------- ---------------
2016-08-04T23               2

A 90th percentile would result in 1.

SELECT * FROM
(
SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY cpu_data)) "90th PERCENTILE"
FROM   cpu_standard_ex2
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <= 1;

THE_HOUR      90th PERCENTILE
------------- ---------------
2016-08-04T22               1

In fact using 97th percentile we also get 2 CPUs, same as with 95th percentile. If we get the results for various percentile values we can build a table like the following:

100th percentile ->          no wait         -> 9 CPUs
 99th percentile -> longest wait is  36 sec. -> 4 CPUs
 97th percentile -> longest wait is 1.8 min. -> 2 CPUs
 95th percenitle -> longest wait is  3  min. -> 2 CPUs
 90th percentile -> longest wait is  6  min. -> 1 CPU
 80th percentile -> longest wait is 12  min. -> 1 CPU

Here is where we come back to our question of what percentile to use.

It depends on how much pain (wait) we are willing to tolerate.

If we choose 4 CPUs for our system our potentially longest wait on our worst hour would be 36 seconds; if we are willing to tolerate a wait of 1.8 min in our worst hour we can choose the 2 CPUs. For this particular example we observe that going from 36 seconds to 1.8 min. will save 2 CPUs out of 4 CPUs, meaning I would have to double the amount of resources to cover for the extra coverage.

Lets say we pick the 2 CPUs and now we want to know how many times the CPU utilization went above those 2 CPUs. We can build a table to hold the number of samples above a specific number of CPUs. We need a procedure to populate such a table and a query to show the result:

CREATE TABLE samples_above
(cpu_required      NUMBER,
 end_date          DATE,
 samples_above     NUMBER
);

CREATE OR REPLACE PROCEDURE get_samples_above
(
 p_cpu_req          IN  samples_above.cpu_required%TYPE
) IS
BEGIN
   INSERT INTO samples_above
   SELECT p_cpu_req, TRUNC(sample_time, 'HH24'), COUNT(cpu_data) samples_above
   FROM cpu_standard_ex2
   WHERE cpu_data > p_cpu_req
   GROUP BY TRUNC(sample_time, 'HH24');
   DBMS_OUTPUT.PUT_LINE(' '); 
   DBMS_OUTPUT.PUT_LINE('get_samples_above: CPU req ' || p_cpu_req);
   COMMIT;
END;
/

TRUNCATE TABLE samples_above;

SET SERVEROUTPUT ON
EXEC get_samples_above(2)

SELECT   cpu_required, TO_CHAR(end_date, 'YYYY-MM-DD"T"HH24') end_date_hour, samples_above
FROM     samples_above
ORDER BY cpu_required, samples_above DESC;

CPU_REQUIRED END_DATE_HOUR SAMPLES_ABOVE
------------ ------------- -------------
           2 2016-08-16T09             6
           2 2016-08-17T22             4
           2 2016-08-16T23             3
           2 2016-08-08T22             3
           2 2016-08-07T00             3
           2 2016-08-12T22             2
           2 2016-08-18T22             2
           2 2016-08-15T00             2
           2 2016-08-13T17             1
           2 2016-08-14T00             1
           2 2016-08-10T23             1
           2 2016-08-10T22             1
           2 2016-08-14T17             1
           2 2016-08-09T22             1
           2 2016-08-07T11             1
           2 2016-08-06T06             1
           2 2016-08-06T16             1
           2 2016-08-06T05             1
           2 2016-08-06T14             1
           2 2016-08-05T22             1
           2 2016-08-12T23             1
           2 2016-08-08T01             1
           2 2016-08-08T07             1
           2 2016-08-06T09             1
           2 2016-08-06T02             1
           2 2016-08-15T22             1
           2 2016-08-07T09             1
           2 2016-08-13T18             1

28 rows selected.

How do we interpret the output?

The busiest hour is 2016-08-16T09 with 6 samples above 2. we are referring to standardized ASH/AWR samples, the ones that are 10 seconds equidistant. There are 360 samples in one hour. If we have 6 samples above 2 CPUs and we assume that those samples are continuous, our longest wait on that hour would be 60 seconds.

If the samples above are not continuous then the waits would be less time. Lets find out if our samples above were continuous or spread within the hour.

CREATE TABLE how_long
(cpu_required    NUMBER,
 end_date        DATE,
 how_long_sec    NUMBER,
 how_long_min    NUMBER,
 max_sample      NUMBER
);

CREATE OR REPLACE PROCEDURE get_how_long
(
 p_cpu_req          IN  samples_above.cpu_required%TYPE
) IS
  CURSOR c_10s_t IS
    SELECT sample_time, cpu_data 
    FROM cpu_standard_ex2
    ORDER BY sample_time; 
  v_end_date            cpu_standard_ex2.sample_time%TYPE;  -- to fetch date
  v_store_end_date      cpu_standard_ex2.sample_time%TYPE;  -- to store the final list of dates where utilization went above #CPUs
  v_on_cpu              cpu_standard_ex2.cpu_data%TYPE;     -- to fetch on_cpu
  v_count               NUMBER := 0;  -- to count how many samples the CPU utilization stayed above #CPUs
  v_max_sample          NUMBER := 0;  -- to hold the max sample in each entry of org6_how_long_t
BEGIN
  OPEN c_10s_t;
  LOOP
    FETCH c_10s_t into v_end_date, v_on_cpu;
    EXIT WHEN c_10s_t%NOTFOUND;
    IF v_on_cpu > p_cpu_req THEN
      v_store_end_date := v_end_date;
      v_count      := 1;
      v_max_sample := v_on_cpu;  
      LOOP
        FETCH c_10s_t into v_end_date, v_on_cpu;
        EXIT WHEN c_10s_t%NOTFOUND;
        IF v_on_cpu > p_cpu_req THEN
          v_count := v_count + 1;
          IF v_on_cpu > v_max_sample THEN
            v_max_sample := v_on_cpu;
          END IF;
        ELSE
          INSERT INTO how_long VALUES(p_cpu_req, v_store_end_date, v_count*10, (v_count*10)/60, v_max_sample);
          EXIT;
        END IF;
      END LOOP;
    END IF;
  END LOOP;
  CLOSE c_10s_t; 
  DBMS_OUTPUT.PUT_LINE(' ');
  DBMS_OUTPUT.PUT_LINE('get_how_long: CPU req ' || p_cpu_req);
  DBMS_OUTPUT.PUT_LINE('get_how_long: Data in mn_how_long_t ');
  COMMIT;
END;
/

TRUNCATE TABLE how_long;
EXEC get_how_long(2);

SELECT cpu_required, to_char(end_date, 'YYYY-MM-DD"T"HH24:MI:SS') "TIME", how_long_sec "SECONDS", round(how_long_min, 2) "MINUTES", max_sample "MAX_SAMPLE" 
FROM how_long
ORDER BY cpu_required, seconds DESC, max_sample DESC;

CPU_REQUIRED TIME                    SECONDS     MINUTES  MAX_SAMPLE
------------ ------------------- ----------- ----------- -----------
           2 2016-08-16T09:47:40          20         .33           5
           2 2016-08-17T22:00:30          20         .33           4
           2 2016-08-12T22:00:40          20         .33           4
           2 2016-08-06T05:31:20          10         .17           9
           2 2016-08-16T09:49:30          10         .17           5
           2 2016-08-16T09:50:30          10         .17           4
           2 2016-08-16T09:48:10          10         .17           4
           2 2016-08-16T09:51:20          10         .17           4
           2 2016-08-07T00:58:10          10         .17           3
           2 2016-08-07T09:16:50          10         .17           3
...
...

Each row in the table tells us the time the samples were above the number of CPUs and how long it took to go down again below the number of CPUs.

Our assumption was not true, the samples above 2 were not continuous, otherwise, we would have seen 60 seconds at the top but we see 20 seconds instead. At most we had 2 continuous samples.

We got the number of samples above 2 CPUs and how long it takes up above those 2 CPUs.

We can get the same information for 1 CPU:

TRUNCATE TABLE samples_above;

SET SERVEROUTPUT ON
EXEC get_samples_above(1)

SELECT   cpu_required, TO_CHAR(end_date, 'YYYY-MM-DD"T"HH24') end_date_hour, samples_above
FROM     samples_above
ORDER BY cpu_required, samples_above DESC;

CPU_REQUIRED END_DATE_HOUR SAMPLES_ABOVE
------------ ------------- -------------
           1 2016-08-09T22            35
           1 2016-08-17T23            26
           1 2016-08-12T22            25
           1 2016-08-06T06            24
           1 2016-08-05T23            23
           1 2016-08-08T22            23
           1 2016-08-08T01            22
           1 2016-08-05T01            22
...
...
           1 2016-08-12T03             1
           1 2016-08-18T06             1
           1 2016-08-14T04             1
           1 2016-08-06T02             1

154 rows selected.

The first row shows that there are 35 samples above 1 CPU. There are 360 samples in one hour, thus 35 samples represent 9.7%, very close to 10% which is 6 minutes.

Tying back to our percentiles table:

100th percentile ->          no wait         -> 9 CPUs
 99th percentile -> longest wait is  36 sec. -> 4 CPUs
 97th percentile -> longest wait is 1.8 min. -> 2 CPUs
 95th percenitle -> longest wait is  3  min. -> 2 CPUs
 90th percentile -> longest wait is  6  min. -> 1 CPU   <<<<<<  80th percentile -> longest wait is 12  min. -> 1 CPU

The 1 CPU is when we use 90th percentile, and potentially our longest wait could be 6 min.

Having the system not able to provide enough CPU resource for 6 minutes seems too long (we are not saying that a session would have to wait 6 min before it gets connected or have a response back, just that we are requesting more CPUs than CPUs available), therefore, we would not say that 1 CPU is enough to support this database workload. We can investigate if the samples above 1 are continuous.

TRUNCATE TABLE how_long;
EXEC get_how_long(1);

SELECT * FROM
(
SELECT cpu_required, to_char(end_date, 'YYYY-MM-DD"T"HH24:MI:SS') "TIME", how_long_sec "SECONDS", round(how_long_min, 2) "MINUTES", max_sample "MAX_SAMPLE" 
FROM how_long
ORDER BY cpu_required, seconds DESC, max_sample DESC
)
WHERE ROWNUM <=10;

CPU_REQUIRED TIME                    SECONDS     MINUTES  MAX_SAMPLE
------------ ------------------- ----------- ----------- -----------
           1 2016-08-09T22:00:30         180           3           3
           1 2016-08-17T22:00:10         100        1.67           4
           1 2016-08-17T23:03:00          80        1.33           2
           1 2016-08-13T06:02:00          70        1.17           2
           1 2016-08-14T06:08:50          60           1           2
           1 2016-08-17T23:33:50          60           1           2
           1 2016-08-05T01:20:40          60           1           2
           1 2016-08-12T22:00:30          50         .83           4
           1 2016-08-06T16:00:50          50         .83           3
           1 2016-08-05T01:52:00          50         .83           2

10 rows selected.

The 35 samples above are not continuous, the maximum number of continuous CPU utilization above 1 is 18 samples (180 seconds).

In real life obviously we do not have a physical server with 1 or 2 CPUs to run and Oracle Database instance but the example is to understand that the percentile used depends on how much pain (wait) we are willing to tolerate.

If we had a VMWare cluster where we have Virtual Machines we could certainly accommodate this example database workload. Because we want to leave some headroom for unexpected spikes in CPU utilization we could start assigning 3 vCPUs to this database workload or even 4 to account also for background processes for VM management. The advantage when using VMs is that the resources can be reallocated.

Les try another example.

The data for the next example has been placed in table CPU_STANDARD_EX3

Lets grab some statistics on this database workload.

SELECT MIN(sample_time) START_TIME, MAX(sample_time) END_TIME, ROUND(MAX(sample_time) - MIN(sample_time)) Retention
FROM cpu_standard_ex3;

START_TIM END_TIME    RETENTION
--------- --------- -----------
03-AUG-16 18-AUG-16          15

We had our retention period set to 15 days thus we have data collected for 15 days starting Aug 3th and ending Aug 18th.

SELECT MIN(cpu_data) MINIMUM, MAX(cpu_data) MAXIMUM
FROM cpu_standard_ex3;

    MINIMUM     MAXIMUM
----------- -----------
          0          21

This database has a minimum CPU utilization of 0 and a maximum of 21 active sessions.

SELECT AVG(cpu_data) AVERAGE, MEDIAN(cpu_data) MEDIAN
FROM cpu_standard_ex3;

    AVERAGE      MEDIAN
----------- -----------
.3324691358           0

CREATE TABLE cpu_standard_ex3_hourly
AS SELECT TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') THE_HOUR, MAX(cpu_data) CPU_DATA
FROM cpu_standard_ex3
GROUP BY TRUNC(sample_time,'HH24')
ORDER BY 1;

Using the hourly data to create a line chart:
example3_hourly
In order to find out the 95th percentile:

SELECT * FROM
(
SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY cpu_data)) "95th PERCENTILE"
FROM   cpu_standard_ex3
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <= 1;

THE_HOUR      95th PERCENTILE
------------- ---------------
2016-08-14T07               4

Using that query we can build the following table:

100th percentile ->          no wait        -> 21 CPUs
 99th percentile -> longest wait is  36 sec. -> 7 CPUs
 97th percentile -> longest wait is 1.8 min. -> 5 CPUs
 95th percenitle -> longest wait is  3  min. -> 4 CPUs
 90th percentile -> longest wait is  6  min. -> 2 CPUs
 80th percentile -> longest wait is 12  min. -> 2 CPUs

Lets pick the 95th percentile of 4 and get the samples above 4 CPUs.

Obtaining the samples above:

CREATE OR REPLACE PROCEDURE get_samples_above
(
 p_cpu_req          IN  samples_above.cpu_required%TYPE
) IS
BEGIN
   INSERT INTO samples_above
   SELECT p_cpu_req, TRUNC(sample_time, 'HH24'), COUNT(cpu_data) samples_above
   FROM cpu_standard_ex3
   WHERE cpu_data > p_cpu_req
   GROUP BY TRUNC(sample_time, 'HH24');
   DBMS_OUTPUT.PUT_LINE(' '); 
   DBMS_OUTPUT.PUT_LINE('get_samples_above: CPU req ' || p_cpu_req);
   COMMIT;
END;
/

TRUNCATE TABLE samples_above;

SET SERVEROUTPUT ON
EXEC get_samples_above(4)

SELECT   cpu_required, TO_CHAR(end_date, 'YYYY-MM-DD"T"HH24') end_date_hour, samples_above
FROM     samples_above
ORDER BY cpu_required, samples_above DESC;

CPU_REQUIRED END_DATE_HOUR SAMPLES_ABOVE
------------ ------------- -------------
           4 2016-08-13T07            12
           4 2016-08-06T07             8
           4 2016-08-09T13             6
           4 2016-08-07T07             5
           4 2016-08-10T07             5
           4 2016-08-08T19             5
           4 2016-08-14T07             5
           4 2016-08-16T07             5
           4 2016-08-05T07             5
           4 2016-08-17T07             5
           4 2016-08-10T05             4
           4 2016-08-11T01             4
...
...

How about 90th percentile?

TRUNCATE TABLE samples_above;

SET SERVEROUTPUT ON
EXEC get_samples_above(2)

SELECT   cpu_required, TO_CHAR(end_date, 'YYYY-MM-DD"T"HH24') end_date_hour, samples_above
FROM     samples_above
ORDER BY cpu_required, samples_above DESC;

CPU_REQUIRED END_DATE_HOUR SAMPLES_ABOVE
------------ ------------- -------------
           2 2016-08-14T07            33
           2 2016-08-06T07            31
           2 2016-08-11T01            27
           2 2016-08-07T07            27
           2 2016-08-08T19            25
           2 2016-08-12T01            25
           2 2016-08-13T07            25
           2 2016-08-05T07            24
           2 2016-08-09T01            23
           2 2016-08-08T07            22
...
...

It seems too busy, lets try 3 CPUs

TRUNCATE TABLE samples_above;

SET SERVEROUTPUT ON
EXEC get_samples_above(3)

SELECT   cpu_required, TO_CHAR(end_date, 'YYYY-MM-DD"T"HH24') end_date_hour, samples_above
FROM     samples_above
ORDER BY cpu_required, samples_above DESC;

CPU_REQUIRED END_DATE_HOUR SAMPLES_ABOVE
------------ ------------- -------------
           3 2016-08-14T07            20
           3 2016-08-05T07            17
           3 2016-08-13T07            16
           3 2016-08-06T07            15
           3 2016-08-07T07            15
           3 2016-08-08T07            14
           3 2016-08-10T07            12
           3 2016-08-04T05            11
           3 2016-08-11T01            10
           3 2016-08-12T05            10
           3 2016-08-17T07            10
...
...

The 3 CPUs correspond to the 93rd percentile.

SELECT * FROM
(
SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.93) WITHIN GROUP (ORDER BY cpu_data)) "93th PERCENTILE"
FROM   cpu_standard_ex3
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <= 1;

THE_HOUR      93th PERCENTILE
------------- ---------------
2016-08-06T07               3

Lets see what we got here. The busiest hour has 20 samples above 3 CPUs, if all 20 samples are continuous the longest wait would be 200 seconds ~ 3 min. and 20 seconds.

In order to see the “how long” results we run the following:

CREATE OR REPLACE PROCEDURE get_how_long
(
 p_cpu_req          IN  samples_above.cpu_required%TYPE
) IS
  CURSOR c_10s_t IS
    SELECT sample_time, cpu_data 
    FROM cpu_standard_ex3
    ORDER BY sample_time; 
  v_end_date            cpu_standard_ex3.sample_time%TYPE;  -- to fetch date
  v_store_end_date      cpu_standard_ex3.sample_time%TYPE;  -- to store the final list of dates where utilization went above #CPUs
  v_on_cpu              cpu_standard_ex3.cpu_data%TYPE;     -- to fetch on_cpu
  v_count               NUMBER := 0;  -- to count how many samples the CPU utilization stayed above #CPUs
  v_max_sample          NUMBER := 0;  -- to hold the max sample in each entry of org6_how_long_t
BEGIN
  OPEN c_10s_t;
  LOOP
    FETCH c_10s_t into v_end_date, v_on_cpu;
    EXIT WHEN c_10s_t%NOTFOUND;
    IF v_on_cpu > p_cpu_req THEN
      v_store_end_date := v_end_date;
      v_count      := 1;
      v_max_sample := v_on_cpu;  
      LOOP
        FETCH c_10s_t into v_end_date, v_on_cpu;
        EXIT WHEN c_10s_t%NOTFOUND;
        IF v_on_cpu > p_cpu_req THEN
          v_count := v_count + 1;
          IF v_on_cpu > v_max_sample THEN
            v_max_sample := v_on_cpu;
          END IF;
        ELSE
          INSERT INTO how_long VALUES(p_cpu_req, v_store_end_date, v_count*10, (v_count*10)/60, v_max_sample);
          EXIT;
        END IF;
      END LOOP;
    END IF;
  END LOOP;
  CLOSE c_10s_t; 
  DBMS_OUTPUT.PUT_LINE(' ');
  DBMS_OUTPUT.PUT_LINE('get_how_long: CPU req ' || p_cpu_req);
  DBMS_OUTPUT.PUT_LINE('get_how_long: Data in mn_how_long_t ');
  COMMIT;
END;
/

TRUNCATE TABLE how_long;
EXEC get_how_long(3);

SELECT cpu_required, to_char(end_date, 'YYYY-MM-DD"T"HH24:MI:SS') "TIME", how_long_sec "SECONDS", round(how_long_min, 2) "MINUTES", max_sample "MAX_SAMPLE" 
FROM how_long
ORDER BY cpu_required, seconds DESC, max_sample DESC;

CPU_REQUIRED TIME                    SECONDS     MINUTES  MAX_SAMPLE
------------ ------------------- ----------- ----------- -----------
           3 2016-08-05T07:31:40          50         .83           7
           3 2016-08-07T07:31:40          40         .67           9
           3 2016-08-16T00:41:00          40         .67           7
           3 2016-08-11T05:46:30          40         .67           5
           3 2016-08-13T07:32:10          30          .5           7
           3 2016-08-16T03:31:50          30          .5           7
           3 2016-08-11T07:31:50          30          .5           7
           3 2016-08-10T07:31:50          30          .5           7
           3 2016-08-17T00:41:10          30          .5           6
           3 2016-08-13T07:37:20          30          .5           6
           3 2016-08-16T07:35:50          30          .5           5
           3 2016-08-14T07:44:10          30          .5           5
           3 2016-08-06T01:02:40          30          .5           5
           3 2016-08-04T05:33:10          30          .5           4
...
...
           3 2016-08-18T07:32:20          10         .17           4
           3 2016-08-18T07:35:20          10         .17           4
           3 2016-08-18T07:38:00          10         .17           4
           3 2016-08-18T07:42:50          10         .17           4

574 rows selected.

How about with 4 CPUs.

TRUNCATE TABLE how_long;
EXEC get_how_long(4);

SELECT cpu_required, to_char(end_date, 'YYYY-MM-DD"T"HH24:MI:SS') "TIME", how_long_sec "SECONDS", round(how_long_min, 2) "MINUTES", max_sample "MAX_SAMPLE" 
FROM how_long
ORDER BY cpu_required, seconds DESC, max_sample DESC;

CPU_REQUIRED TIME                    SECONDS     MINUTES  MAX_SAMPLE
------------ ------------------- ----------- ----------- -----------
           4 2016-08-13T07:32:10          30          .5           7
           4 2016-08-07T07:31:40          20         .33           9
           4 2016-08-13T07:32:50          20         .33           9
           4 2016-08-06T07:32:40          20         .33           9
           4 2016-08-10T05:46:40          20         .33           7
           4 2016-08-17T00:41:10          20         .33           6
           4 2016-08-08T07:31:40          20         .33           6
           4 2016-08-12T03:44:50          20         .33           6
           4 2016-08-13T07:37:30          20         .33           6
           4 2016-08-16T07:36:00          20         .33           5
           4 2016-08-14T05:46:50          20         .33           5
           4 2016-08-10T07:37:20          20         .33           5
           4 2016-08-13T05:33:00          20         .33           5
...
...
           4 2016-08-11T05:46:50          10         .17           5
           4 2016-08-11T07:31:50          10         .17           5
           4 2016-08-12T00:41:30          10         .17           5
           4 2016-08-12T05:32:10          10         .17           5

306 rows selected.

The 93rd percentile would be a good option, because the longest wait is not more than 50 seconds, however, there are many times when the system goes up above 3 CPUs (574 times), even with 4 CPUs the system goes up above 4 CPUs 306 times.

We could use the 4 CPUs as the number of CPUs for this database. If we wanted this system to have headroom of 50% we would add 4 more CPUs for a total of 8 CPUs. If we wanted to have this as the starting value of a VM for this database workload we could start with 5 (4 + 1 for VM overhead) and monitor to see if more are needed.

As a side note:
I am using this space to recommend the use of Oracle Resource Manager to control the spikes in CPU utilization.

“High database loads destabilize the server. Many operating system vulnerabilities and bugs are exposed at excessive CPU loads. In addition, excessive CPU loads can starve critical applications and monitoring tools running on the server.”

“One key to a healthy database is maintaining a healthy CPU load. Excessive CPU load can destabilize the server and expose operating system bugs. Excessive CPU load can also prevent critical Oracle background processes from running in a timely manner, resulting in failures such as database instance evictions on a RAC database. At such times, your response time may be so poor that you may be unable to debug the source of the excessive CPU load, for example by identifying and fixing a faulty SQL plan.
Using Oracle Database Resource Manager, you can ensure that your database’s CPU load is always healthy, thus avoiding all of these problems. Resource Manager manages your CPU load so that all CPUs are fully utilized yet there is no thrashing since no (or very few) Oracle processes are waiting in the operating system run queue. Using a database feature called Instance Caging, you can configure Resource Manager to manage the CPU load to even lower levels.”

Reference: Using Oracle Database Resource Manager

In this blog post we have learned how to choose a percentile to decide how many CPUs are required for our database workload and review how many times the CPU utilization goes above our number of CPUs and for how long that utilization stays above our specified number of CPUs.

With this information we are able to decide how many CPUs we want to allocate for our database.

It is important to size correctly because if we undersusbscribe we risk the system to come to a halt and if we oversubscribe we are spending more money than needed on hardware and licenses.

If you are migrating to a new system and need to know the number of CPUs in the target system which is different from the actual system you need to translate the processing power of the actual server to the new server. If you want to learn how to do that stay tuned for the next blog post: “Sizing post #5: How to map our existing processing power to new hardware.”

Author: Jorge Barba

Sizing post #3: Understanding percentiles

Hello, here we are now on post #3 on my favorite topic: “sizing”.

So far, we have learned how to standardize data coming from ASH to use it for sizing. Sizing post #1: How to standardize ASH data for sizing.

In our previous post we talked about average (mean), median, minimum, and maximum. Sizing post #2: The not so good mean, the bad median, and the ugly minimum and maximum where we realized that those statistics are not the way to account for the number of CPUs required for our database workload because the CPU utilization can go up drastically and we may not have enough processing power. Nobody wants their system to come to a halt, right?

Here we are going to explain percentiles and see how we can use it to attempt to come up with the number of CPUs required to support our example workload.

A short and good explanation is in the article “How to Calculate Percentiles in Statistics by Deborah J. Rumsey”

Basically the percentile is … well … instead of writing here what it is, lets just try to come up with the definition ourselves with an example.

Lets start with a data set, which can be our own cpu_data column values.

2, 3, 1, 5, 10, 25, 2, 1, 9, 3, 2, 2, 2, 1, 2, 3, 5, 2, 2, 2

We have 20 values in our data set.

We order them ascending and we get this list:

1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 5, 5, 9, 10, 25

Lets decide that we want to work with 90th percentile.

What does that mean? It means that we want to find out what value in the list will have 10% of the values above it (to its right) and 90% of values below it (to its left).

If I have 20 values and that is 100% of my values then the 90% of values is 18, if I assign positional numbers to my list the 18 bucket will be the 9.

Value:    1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 5, 5, 9, 10, 25
Position: 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18  19  20

Because I need to get the value in the middle of position between 18 and 19 I get the average between position 18 and 19 which are values 9 and 10, its average is ( 9 + 10 ) / 2 = 9.5

By formula:

To get the position:
Value # = ( Percentile / 100 ) * (n + 1) = ( 90 / 100 ) * ( 21 ) = 18.9

There is no position 18.9 thus I just get the average between the position 18 and 19 ( 9 + 10 ) / 2 = 9.5
Note that the method I am using calculates the average but I could also extrapolate to be more accurate.

For this data set the values represent CPUs therefore the decimal will not apply and I would round up to 10.

With this data set we conclude that:

The 90th percentile is 9.5 rounding it up to 10
 90% of the values are below 10
 10% of the values are above 10

To conclude our little section explaining percentiles here the definition of percentile:

Percentile (in our case the 9.5): is the value below which a percentage (in our case 90%) of data falls.

This data set is too small to be practical to decide what cpu requirement is needed. When we get to calculate percentiles for a real data set like our example with 69,120 values we would have a better approximation of what we need in terms of number of CPUs for our workload.

For our convenience we use the Oracle function PERCENTILE_DISC(0.9) to get the 90th percentile.

We are running this against our data set of 69,120 data points

SELECT PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY cpu_data) "Percentile_Disc"
FROM   cpu_standard;

Percentile_Disc
---------------
              5

SELECT PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY cpu_data) "Percentile_Cont"
FROM   cpu_standard;

Percentile_Cont
---------------
              5

In this case we got the same value of 5, however, there is a difference between PERCENTILE_DISC (discrete) and PERCENTILE_CONT (continuous). The discrete function returns a value that exists in the data set and the continuous function returns an interpolated value that is not necessarily in the data set.

How do we interpret our result of 5?

The 90th percentile is 5
90% of the values are below 5
10% of the values are above 5

For the 97th percentile:

SELECT ROUND(PERCENTILE_CONT(0.97) WITHIN GROUP (ORDER BY cpu_data)) "Percentile_Cont"
FROM   cpu_standard;

Percentile_Cont
---------------
              9

The 97th percentile is 9
97% of the values are below 9
 3% of the values are above 9

I am using PERCENTILE_CONT(0.97) instead PERCENTILE_DISC(0.97) because I want to get a more accurate value even if the value does not exist in the data set, then I can round it up with the ROUND function.

The interpretation in terms of CPU utilization

We can say that for my whole data set of 69,120 data points the 97th percentile is 9 and also that 3% of the values in my data set are above 9. In this case if I were to move my database workload to a system with 9 CPUs then 3% of the time I would be using more than those 9 CPUs and there will be waits. The problem here is that we are using the whole data set and we do not know the details of the distribution of the CPU utilization, meaning that we do not know if that 3% above 9 CPUs is condensed in just a few hours or spread across the 8 days of collected data.

Why is important to know how the utilization is distributed?

We have 8 days of data which in hours is (8 * 24) = 192 hours, 3% of 192 hours is 5.76 ~ 6 hours
Imagine that the 3% above the 9 CPUs utilization is condensed in only 6 hours instead of spread across the 8 days. We would have the system to come to a halt when that first hour of CPU utilization starts using above 9 CPUs which is the total capacity of our imaginary new system.

Using the percentiles for the whole data set is not a good option to know the CPU required for our database workload, however, we still can use percentiles, can’t we? What if we get percentiles by the hour and somehow overcome the problem of not knowing the distribution of the CPU utilization?

Let’s see…

How can we get the 97th percentile for each hour?

SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.97) WITHIN GROUP (ORDER BY cpu_data)) "97th PERCENTILE"
FROM   cpu_standard
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 1;

THE_HOUR      97th PERCENTILE
------------- ---------------
2016-09-06T19               6
2016-09-06T20               5
2016-09-06T21               4
2016-09-06T22               5
2016-09-06T23               4
...
...
2016-09-14T06               6
2016-09-14T07               9
2016-09-14T08              10
2016-09-14T09              14
2016-09-14T10              12
2016-09-14T11               6
2016-09-14T12               6
2016-09-14T13               3
2016-09-14T14              10
2016-09-14T15               7
2016-09-14T16               2
2016-09-14T17               3
2016-09-14T18               0

192 rows selected.

Lets materialize these results in a table and create a couple of charts

CREATE TABLE cpu_standard_hour_perc
AS SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.97) WITHIN GROUP (ORDER BY cpu_data)) "97th PERCENTILE"
FROM   cpu_standard
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 1;

DESC cpu_standard_hour_perc

Name                  Null?    Type
--------------------- -------- ----------------------
THE_HOUR                       VARCHAR2(13)
97th PERCENTILE                NUMBER

percentile97th
This chart above shows in the horizontal line the time and it is for every 3 hours, the chart below is a close up that shows every hour in a specific section of the chart above.
percentile97th-closeup
How do we interpret these results?

We got one row per hour. We have 192 hours in our data set and we have calculated the 97th percentile for each hour.

Lets grab the first row that is showing in the output:

THE_HOUR      97th PERCENTILE
------------- ---------------
2016-09-06T19               6

During the hour from 7pm to 8pm on 06-SEP-2016:

The 97th percentile is 6
97% of the values are below 6
 3% of the values are above 6

Here is where we get into more detail on what is going on within the hour. Remember that our samples are taken every 10 seconds, and that means that in one hour (there are 3,600 seconds in one hour) we have 360 samples equidistant 10 seconds from each other. If 3% of the values are above 6 in one hour, that means I had CPU utilization above 6 during 1.8 minutes. If those were to be contiguous which would be the worst case I would have the system on wait for 1.8 minutes.

Now that I have one 97th percentile value for each hour I can calculate the maximum 97th percentile value for the whole data set grouped by the hour.

SELECT * FROM
(
SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.97) WITHIN GROUP (ORDER BY cpu_data)) "97th PERCENTILE"
FROM   cpu_standard
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <= 1;

THE_HOUR      97th PERCENTILE
------------- ---------------
2016-09-07T10              15

In terms of the 97th percentile the result from this last query would be the worst hour or in other words the busiest hour in which:

The 97th percentile is 15
97% of the values are below 15
 3% of the values are above 15

In order to fulfill this requirement we would need 15 CPUs and we still have 1.8 minutes wait during this particular hour.

It is a good time to mention here that when we size we do not size to the 100%, in this case if we need 15 CPUs my system would need to leave extra room for unexpected spikes in utilization. In order to protect the system with a cushion of lets say 30% we would need 21.4 CPUs

because if 15 CPUs is 70% then (15 * 100) / 70 = 21.4 CPUs

In this case we have come up with the number 15 for the cpu requirement for our database workload because we don’t want waits over 1.8 minutes within the hour using the 97th percentile for each hour and getting the busiest hour.

What we are really doing here is cropping spikes in the CPU utilization. If we remember we had a maximum of 27 CPUs

SELECT * FROM 
(
SELECT the_hour, cpu_data
FROM cpu_standard_hourly
ORDER BY 2 DESC
)
WHERE rownum <=1;

THE_HOUR         CPU_DATA
------------- -----------
2016-09-14T13          27

and we are saying that we would have enough with 15 CPUs. This is because the 27 was a one time sample of 27 active sessions and the CPU utilization went down suddenly and there were no long waits. We don’t really need 27 CPUs, if the system is being asked for 27 CPUs and it does not have them it will queue the tasks and could be waits depending on how long the CPU resource is required which in this case was not longer than the 10 seconds sample interval.

We have showed that using percentiles is a good method to measure the CPU required and can help us identify the busiest hours in our database workload.

If we were willing to have longer waits we could use the 95th percentile instead of the 97th percentile. With the 95th percentile means we get a CPU value on which 95% of values fall below and 5% are above. This means that we can have waits for 5% of the hour; 5% of each hour is 3 minutes.

We can build a table to show how long the waits could be according to each percentile value:

97th percentile -> longest wait is 1.8 min.
95th percenitle -> longest wait is  3  min.
90th percentile -> longest wait is  6  min.
80th percentile -> longest wait is 12  min.

Question is: How do we pick the percentile that is appropriate for my database workload?

Answer: It depends on how much pain (waits) we are willing to tolerate…

Stay tuned for the next blog post on sizing matters: “Sizing post #4: How much pain are we willing to tolerate?”

Author: Jorge Barba

Sizing post #2: The not so good mean, the bad median, and the ugly minimum and maximum

This is the second blog entry for a series of posts related to the topic of “sizing.”

This time we are going to chart the cpu usage and calculate the average (mean), median, minimum, and maximum from the ASH data we standardized in previous post “Sizing post #1: Sizing post #1: How to standardize ASH data for sizing.”

In our previous post we ended up with a table called CPU_STANDARD with two columns SAMPLE_TIME and CPU_DATA. The SAMPLE_TIME contains 10 seconds equidistant times and the CPU_DATA column contains the number of active sessions for each sample time.

DESC cpu_standard

 Name					   Null?    Type
 ----------------------------------------- -------- ----------------------------
 SAMPLE_TIME					    DATE
 CPU_DATA					    NUMBER

SELECT TO_CHAR(sample_time,'YYYY-MM-DD"T"HH24:MI:SS'), cpu_data 
FROM cpu_standard
ORDER BY 1;

SAMPLE_TI    CPU_DATA
--------- -----------
...
...
2016-09-14T16:48:50           1
2016-09-14T16:49:00           1
2016-09-14T16:49:10           4
2016-09-14T16:49:20           1
2016-09-14T16:49:30          10
2016-09-14T16:49:40           1
2016-09-14T16:49:50           3
2016-09-14T16:50:00           1
2016-09-14T16:50:10           1
...
...

69120 rows selected.

The data in table CPU_STANDARD was collected from a Workload Repository with a retention period of only 8 days.

We can figure out the amount of days by two methods.

1. Numer of rows.

There are 69,120 rows in the CPU_STANDARD table. Each row represents a 10 second period because we have one sample every 10 seconds. There are 3,600 seconds in one hour which means there are 360 samples (one sample every 10 seconds) in each hour. Divide 69,120 by 360 to get the number of hours 69,120 / 360 = 192. Divide 192 by 24 hours to get the number of days. 192 / 24 = 8. The retention period was only 8 days.

2. Min and Max sample time.

SELECT MIN(sample_time) min_st, MAX(sample_time) max_st 
FROM cpu_standard;

MIN_ST    MAX_ST
--------- ---------
06-SEP-16 14-SEP-16

Our cpu data is from September 6th to September 14th which is a period of 8 days.

I recommend to have 45 days retention period or at least 30 days to make sure we capture the activitiy that cycles every month, for example a monthly backup or the closing month accounting processes, however, for the purposes of this blog post we can work with the 8 days cpu data.

Visualization

How about we get a chart for our cpu data to visualize how much our utilization of cpu our database is consuming. How can we chart 69,120 data points? Trying to chart 69,120 points in excel may not work and most likely we will get “Application not responding” message. In fact a chart with so many data points may not be as useful as we expect.

In order to have an idea on the workload of our database we can generate a chart but instead of using every 10 second sample we can generate one data point for every 360 samples (one data point per hour).

How do we summarize 360 data points into just one? We could generate an average (mean) or get the median or get the maximium value.

Lets use the maximum value of each hour to represent each hour in our chart. The following query truncates the date column to the hour, and gets the maximum value of each hour:

SELECT TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') THE_HOUR, MAX(cpu_data)
FROM cpu_standard
GROUP BY TRUNC(sample_time,'HH24')
ORDER BY 1;

...
...
2016-09-14T12             8
2016-09-14T13            27
2016-09-14T14            15
2016-09-14T15            12
2016-09-14T16            10
2016-09-14T17             7
2016-09-14T18             1

192 rows selected.

Now we have a more decent size case for our chart. We can chart 192 data points easily in excel.

We are going to first create a table with the hourly data

CREATE TABLE cpu_standard_hourly
AS SELECT TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') THE_HOUR, MAX(cpu_data) CPU_DATA
FROM cpu_standard
GROUP BY TRUNC(sample_time,'HH24')
ORDER BY 1;

Once we have our table with only 192 data points in Sql Developer we export the data to excel format. If you haven’t exported table data in Sql Developer you will be surprised how easy it is, review documentation Database Express Edition 2 Day DBA 11.2 chapter 10 Exporting and Importing Metadata and Data

When we open the exported data in excel we have two columns of data, the column titled THE_HOUR and the CPU_DATA column. This is the data we are going to use for our chart. One very simple example on how to create a line chart is shown in this page: MS Excel 2016: How to Create a Line Chart.

From the following chart we can observe the minimum utilization at 1 and the maximum utilization at 27.

hourly

We can also get those values with queries

SELECT MIN(cpu_data) MIN_CPU, MAX(cpu_data) MAX_CPU
FROM cpu_standard_hourly;

    MIN_CPU     MAX_CPU
----------- -----------
          1          27

SELECT * FROM 
(
SELECT the_hour, cpu_data
FROM cpu_standard_hourly
ORDER BY 2
)
WHERE rownum <=1;

THE_HOUR         CPU_DATA
------------- -----------
2016-09-14T18           1

SELECT * FROM 
(
SELECT the_hour, cpu_data
FROM cpu_standard_hourly
ORDER BY 2 DESC
)
WHERE rownum <=1;

THE_HOUR         CPU_DATA
------------- -----------
2016-09-14T13          27

Remember that the 27 is the maximum utilization within that hour but not necessarily the cpu was at 27 active sessions the whole hour. It is just that the maximum sampled cpu data reached 27. Because the hourly data hides the detail of what happens within the hour we can chart data for the 10 second samples in the CPU_STANDARD table not for the whole 8 days retention period but only for smaller time periods of interest for example one hour.

Lets chart 360 points of that one hour “2016-09-14T13” that reached the maximum of 27 active sessions. The following query gets the data points

SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') sample_time, cpu_data
FROM cpu_standard
WHERE TO_CHAR(sample_time,'YYYY-MM-DD"T"HH24') = '2016-09-14T13'
ORDER BY 1;

...
...
2016-09-14T13           1
2016-09-14T13           1
2016-09-14T13           1
2016-09-14T13           1
2016-09-14T13           1
2016-09-14T13           1
2016-09-14T13           1

360 rows selected.

Lets create a table to hold exactly that data

CREATE TABLE cpu_standard_14T13
AS SELECT sample_time, cpu_data
FROM cpu_standard
WHERE TO_CHAR(sample_time,'YYYY-MM-DD"T"HH24') = '2016-09-14T13'
ORDER BY 1;

Exporting to excel and charting we get

cpu14t13-1

The spike of 27 active sessions happend only for one sample, the rest is 1s and 2s

Scrolling to the right we can see other periods of time

cpu14t13-2

There are some 3s, 4s, 6s, and 8s and 6s and the majority of the samples are 1s.

So far, we have created charts for data points representing the maximum sample of every hour and we also created a chart with 360 data points for one single hour to observe the details of what happens regarding cpu utilization within the hour.

Lets get some other statistics

The average (mean) and the median utilization of the database:

SELECT AVG(cpu_data)
FROM cpu_standard;

AVG(CPU_DATA)
-------------
  2.452748843

SELECT MEDIAN(cpu_data)
FROM cpu_standard;

MEDIAN(CPU_DATA)
----------------
               2

Question is: How can we use this information to know what is the number of cpus that I really need to support the load of my database if some times it uses 27 and some times it uses 1 with an average of 2.45 and a median of 2?

Using the average of 2.45 or the median of 2 cpus as the requirement for this database is not realistic because the database will go above 2 cpus and will generate a lot of waits at those specific times. Using the maximum of 27 won’t be realistic either because I am wasting resources as those 27 were reached only once meaning it was only for a very short period of time in a logon storm fashion but the utilization went down again really quick.

Well, the answer is, we don’t know yet, the not so good mean and the bad median with ugly minimum and maximum don’t give me the answer yet but at least we have some knowledge about how our database behaves and how many cpus are active at specific times. We can say we got a profile of our workload. This profile can vary from database to database in the organization depending on what applications are connecting to the database, in fact the entity that defines the workload is the application or set of applications making use of the database.

Stay tuned for the next blog post on sizing matters.

Author: Jorge Barba

 

Sizing post #1: How to standardize ASH data for sizing

This is a first blog entry for a series of posts related to the topic of “sizing.”

The standardization of ASH data for sizing consists of aggregating the number of sesions on CPU, adding the 0’s (zeroes) to this aggregated data and filling in with sample times to have 10 seconds equidistant sample times. The original idea for standardization was not mine, it was my friend Mauro Pagano that allowed me to use his idea for my blog.

When we want to know the CPU footprint of our database we can do some research using the data from AWR. The sample data stored in AWR that comes from ASH can be used to measure the CPU footprint of a database among several other measures.

The CPU data from ASH is sampled only when there is activity, meaning that periods of time without activity do not get recorded in ASH. This causes a problem in the case we want a time series with equidistant points to visualize or analyze the CPU usage of a database.

The following article shows one of several different ways to standardize the aggregation of ASH data into 10 seconds equidistant sample times. This includes adding the times where there is no activity or in other words the 0 (zero) values for cpu data.

Being AWR our source of data we need to make sure it is configured appropriately for our purposes.

We want to make sure the snapshot interval is 60 minutes and the retention period is 45 days in addition to verifying that the STATISTICS_LEVEL parameter is set to TYPICAL or ALL.

The following command shows the value for the parameter statistics_level

SQL> show parameter statistics_level

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
statistics_level                     string      TYPICAL

If the statistics_level parameter is not set to TYPICAL or ALL we can change it using the following command:

ALTER SYSTEM SET statistics_level = TYPICAL scope=both;

The following query shows the value for snapshot_interval and retention period

SELECT EXTRACT(DAY    FROM snap_interval)*24*60+
       EXTRACT(HOUR   FROM snap_interval)   *60+
       EXTRACT(MINUTE FROM snap_interval)           INTERVAL_IN_MINUTES,
       (EXTRACT(DAY    FROM retention)*24*60+
       EXTRACT(HOUR   FROM retention)*60+
       EXTRACT(MINUTE FROM retention))/(24*60)      RETENTION_IN_DAYS
FROM dba_hist_wr_control;

INTERVAL_IN_MINUTES RETENTION_IN_DAYS
------------------- -----------------
                 60                 8

If the snapshot_interval is not 60 minutes or the retention is not 45 days we can change them using the follwing command:

BEGIN
  DBMS_WORKLOAD_REPOSITORY.modify_snapshot_settings(
    retention => 64800,       -- Minutes (= 45 Days). Current value retained if NULL.
    interval  => 60);         -- Minutes. Current value retained if NULL.
END;
/

SELECT EXTRACT(DAY    FROM snap_interval)*24*60+
       EXTRACT(HOUR   FROM snap_interval)   *60+
       EXTRACT(MINUTE FROM snap_interval)           INTERVAL_IN_MINUTES,
       (EXTRACT(DAY    FROM retention)*24*60+
       EXTRACT(HOUR   FROM retention)*60+
       EXTRACT(MINUTE FROM retention))/(24*60)      RETENTION_IN_DAYS
FROM dba_hist_wr_control;

INTERVAL_IN_MINUTES RETENTION_IN_DAYS
------------------- -----------------
                 60                45

V$ACTIVE_SESSION_HISTORY displays sampled session activity in the database. It contains snapshots of active database sessions taken once a second. A database session is considered active if it was on the CPU or was waiting for an event that didn’t belong to the Idle wait class. This view contains one row for each active session per sample and returns the latest session sample rows first.

DBA_HIST_ACTIVE_SESS_HISTORY view displays the history of the contents of the in-memory active session history of recent system activity. This view contains snapshots of V$ACTIVE_SESSION_HISTORY.

The following query shows the sample time and the number of samples at that specific sample time. This interprets as 3 sessions active at the time 2016-12-20T16:00:53, 5 sessions active at the time 2016-12-20T16:11:34 and so on.

SELECT   TO_CHAR(sample_time, 'YYYY-MM-DD"T"HH24:MI:SS') SAMPLE_TIME
        ,COUNT(*)  NUMBER_OF_SAMPLES
FROM     dba_hist_active_sess_history
WHERE    session_state='ON CPU'
GROUP BY sample_time
ORDER BY 1;

SAMPLE_TIME         NUMBER_OF_SAMPLES
------------------- -----------------
...
...
2016-12-20T16:00:53           3
2016-12-20T16:11:34           5
2016-12-20T16:11:44           1

8411 rows selected.

If we wanted to extract only the rows related to the sessions ‘ON CPU’ and discard the sessions on wait events. We can create a table and insert the data from dba_hist_active_sess_history. Because the sample time is not equidistant meaning the samples can happen now, then 9 seconds later, then 11 seconds later, we are going to call the table NON_STANDARD_TIME

DROP TABLE non_standard_time;
CREATE TABLE non_standard_time
(sample_time  DATE,
 cpu_data     NUMBER);

DESC non_standard_time
Name					   Null?    Type
----------------------------------------- -------- ----------------------------
SAMPLE_TIME					    DATE
CPU_DATA					    NUMBER

INSERT INTO non_standard_time
SELECT   sample_time, COUNT(*) cpu_data
FROM     dba_hist_active_sess_history
WHERE    session_state = 'ON CPU'
GROUP BY sample_time
ORDER BY 1;

8414 rows created.

SELECT TO_CHAR(sample_time,'YYYY-MM-DD"T"HH24:MI:SS') sample_time
      ,cpu_data 
FROM   non_standard_time
ORDER BY 1;

SAMPLE_TIME            CPU_DATA
------------------- -----------
...
...
2016-12-20T15:33:20           1
2016-12-20T15:43:41           1
2016-12-20T16:00:53           3
2016-12-20T16:11:34           5
2016-12-20T16:11:44           1
2016-12-20T16:51:08           1
2016-12-20T16:56:08           1
2016-12-20T16:56:18           1

8414 rows selected.

We need to standardize the data. The data right now is not equidistant and does not include the time where the samples are 0, meaning the ASH data does not get recorded when there is no activity. In order for us to have an equidistant sample time and include the times when there is no activity we standardize the data as follows:

The following query will show the minimum and maximum times in AWR view dba_hist_active_sess_history

COL min_time format a30
COL max_time format a30
SELECT MIN(to_char(sample_time,'YYYY-MM-DD"T"HH24:MI:SS')) min_time
      ,MAX(to_char(sample_time,'YYYY-MM-DD"T"HH24:MI:SS')) max_time
FROM   dba_hist_active_sess_history
WHERE  session_state = 'ON CPU';

MIN_TIME                       MAX_TIME
------------------------------ ------------------------------
2016-11-23T01:16:19            2016-12-20T16:56:18

This is the same information from our non_standard_time table

SELECT MIN(to_char(sample_time,'YYYY-MM-DD"T"HH24:MI:SS')) min_time
      ,MAX(to_char(sample_time,'YYYY-MM-DD"T"HH24:MI:SS')) max_time
FROM   non_standard_time;

MIN_TIME                       MAX_TIME
------------------------------ ------------------------------
2016-11-23T01:16:19            2016-12-20T16:56:18

Start Time: 2016-11-23T01:16:19 -> 2016-11-23T01:16:10

Table should have:

2016-11-23T01:16:10
2016-11-23T01:16:20
2016-11-23T01:16:30
2016-11-23T01:16:40
...
...

Until around
2016-12-20T16:56:10

Create a new table with the sample time starting at 00 seconds and 10s interval
Sample time should not have gaps in this new table meaning all 10s intervals should be there
Fill in this new table with ASH data
Fill in with 0s when there is no row for a specific 10s interval.

DROP TABLE cpu_standard;
CREATE TABLE cpu_standard
(sample_time DATE,
 cpu_data    NUMBER);

The following PL/SQL block fills the table with 0s for cpu_data column and sample_time at 10s intervals

DECLARE
 sample_10s DATE := TO_DATE('2016-11-23T01:16:10','YYYY-MM-DD"T"HH24:MI:SS');
BEGIN
LOOP
 INSERT INTO cpu_standard
 VALUES (sample_10s, 0);
 sample_10s := sample_10s + ( 1 / ( 24 * 60 * 6 ) );
 IF (sample_10s > TO_DATE('2016-12-20T16:56:10', 'YYYY-MM-DD"T"HH24:MI:SS'))
 THEN EXIT;
 END IF;
END LOOP;
END;
/

SELECT TO_CHAR(sample_time,'YYYY-MM-DD"T"HH24:MI:SS') sample_time
 ,cpu_data
FROM cpu_standard
ORDER BY 1;

SAMPLE_TIME CPU_DATA
------------------- -----------
...
...
2016-12-20T16:55:40 0
2016-12-20T16:55:50 0
2016-12-20T16:56:00 0
2016-12-20T16:56:10 0

238921 rows selected.

The table at this point has only predefined dates with 10 seconds distance between each row and all the values are 0.

We need to bring in the data from AWR into this table

The rule we are going to follow is to truncate the last digit, for example for a sample time of 09 seconds we take that value for our 00 second

seconds 
00 -> 00
01 -> 00
02 -> 00
...
08 -> 00
09 -> 00
--------
10 -> 10
11 -> 10
12 -> 10
...
18 -> 10
19 -> 10
--------
50 -> 50
51 -> 50
...
58 -> 50
59 -> 50
--------

We are simply replacing the last digit with a 0.

Before we update the table cpu_standard lets verify that the data treatment we are doing is correct

SELECT *
FROM 
(
SELECT TO_CHAR( TRUNC( sample_time, 'mi' ) + 
      ( TRUNC( ( TO_NUMBER( TO_CHAR( sample_time, 'ss' ) ) / 10 ) ) * ( 1 / ( 24 * 60 * 6 ) ) ),
      'YYYY-MM-DD"T"HH24:MI:SS') sample_time 
FROM  non_standard_time
ORDER BY sample_time DESC
)
WHERE rownum<=4
ORDER BY 1;

SAMPLE_TIME
-------------------
2016-12-20T16:11:40
2016-12-20T16:51:00
2016-12-20T16:56:00
2016-12-20T16:56:10

Compared to the data in non_standard_time we can observe that the last digit has been converted to 0.

SAMPLE_TIME            CPU_DATA
------------------- -----------
...
...
2016-12-20T16:11:44           1
2016-12-20T16:51:08           1
2016-12-20T16:56:08           1
2016-12-20T16:56:18           1

8414 rows selected.

Updating the cpu_standard table to bring the non-zero values to its respective 10 second sample.

Up to this point we have two tables:

CPU_STANDARD

SQL> desc cpu_standard
 Name					   Null?    Type
 ----------------------------------------- -------- ----------------------------
 SAMPLE_TIME					                    DATE
 CPU_DATA					                        NUMBER

NON_STANDARD_TIME has values in sample time that are not equidistant

SELECT TO_CHAR(sample_time,'YYYY-MM-DD"T"HH24:MI:SS') sample_time
      ,cpu_data 
FROM   non_standard_time
ORDER BY 1;

SAMPLE_TIME            CPU_DATA
------------------- -----------
...
...
2016-12-20T15:33:20           1
2016-12-20T15:43:41           1
2016-12-20T16:00:53           3
2016-12-20T16:11:34           5
2016-12-20T16:11:44           1
2016-12-20T16:51:08           1
2016-12-20T16:56:08           1
2016-12-20T16:56:18           1

8414 rows selected.

CPU_STANDARD table has the data with sample time equidistant 10 seconds apart on each row and all values as 0s

SELECT TO_CHAR(sample_time,'YYYY-MM-DD"T"HH24:MI:SS') sample_time
      ,cpu_data
FROM   cpu_standard
ORDER BY 1;

SAMPLE_TIME            CPU_DATA
------------------- -----------
...
...
2016-12-20T16:55:40           0
2016-12-20T16:55:50           0
2016-12-20T16:56:00           0
2016-12-20T16:56:10           0

238921 rows selected.

The next step would be to merge the data in a way that CPU_STANDARD has the CPU_DATA from the NON_STANDARD_TIME table.

There may be several ways to do it, lets just do it explicilty step by step to understand better what is happening.

We are going to add a new column to the NON_STANDARD_TIME table to hold the standard time

ALTER TABLE non_standard_time
ADD (sample_time_std  DATE);

We update that new added column to have the standard corresponding date value

UPDATE non_standard_time
SET    sample_time_std =
 TRUNC( sample_time, 'mi' ) + 
      ( TRUNC( ( TO_NUMBER( TO_CHAR( sample_time, 'ss' ) ) / 10 ) ) * ( 1 / ( 24 * 60 * 6 ) ) );

8414 rows updated.

COMMIT;

The table NON_STANDARD_TIME has now a column with the sample time standardized.

SELECT TO_CHAR(sample_time,'YYYY-MM-DD"T"HH24:MI:SS') sample_time
      ,cpu_data
      ,TO_CHAR(sample_time_std,'YYYY-MM-DD"T"HH24:MI:SS') sample_time_std
FROM   non_standard_time
ORDER BY 1;

SAMPLE_TIME            CPU_DATA SAMPLE_TIME_STD
------------------- ----------- -------------------
...
...
2016-12-20T16:11:44           1 2016-12-20T16:11:40
2016-12-20T16:51:08           1 2016-12-20T16:51:00
2016-12-20T16:56:08           1 2016-12-20T16:56:00
2016-12-20T16:56:18           1 2016-12-20T16:56:10

8414 rows selected.

In this case we have now the same standardized sample times in both tables NON_STANDARD_TIME and CPU_STANDARD

We can join the two tables to bring the cpu_data values into CPU_STANDARD

UPDATE cpu_standard
SET    cpu_data  = ( SELECT MAX(cpu_data) 
                     FROM non_standard_time ns
                     WHERE  ns.sample_time_std = cpu_standard.sample_time)
WHERE EXISTS (
  SELECT 'X'
  FROM non_standard_time ns
  WHERE cpu_standard.sample_time = ns.sample_time_std);

8414 rows updated.

COMMIT;

SELECT TO_CHAR(sample_time,'YYYY-MM-DD"T"HH24:MI:SS') sample_time
      ,cpu_data
FROM   cpu_standard
ORDER BY 1;

SAMPLE_TIME            CPU_DATA
------------------- -----------
...
...
2016-12-20T16:55:40           0
2016-12-20T16:55:50           0
2016-12-20T16:56:00           1
2016-12-20T16:56:10           1

238921 rows selected.

The table cpu_standard can be used to graph the cpu activity of the database and to further analyze the data for sizing purposes.

Author: Jorge Barba