Web Tortoise

2013-Jan-31

Arithmetic Mean Versus Geometric Mean Versus Median

Response:

Hello! This #WebTortoise post was written 2013-JAN-31 at 09:06 PM ET (about #WebTortoise).

Main Points

#- An Arithmetic Mean will, for all intent and purpose in WebTortoise World, result in a higher value than its Geometric Mean counterpart. Relative to “faster is better” in web performance, might say an Arithmetic Mean is a pessimistic calculation.

#- A Geometric Mean will, for all intent and purpose in WebTortoise World, result in a lower value than its Arithmetic Mean counterpart. Relative to “faster is better” in web performance, might say a Geometric Mean is an optimistic calculation.

#- Define: What is a Percentile?

#- See, “How do I calculate the Geometric Mean in Excel”?

Story

Had an opportunity to discuss which statistical calculation should be used when looking at Performance charts. The discussion summary goes something like this.

First, assume consideration for a central-tendency calculation. Then:

If, in fact, looking for spurious outliers, consider plotting the Arithmetic Mean average.

Otherwise, consider plotting either the Geometric Mean or the Median, as they are very good central-tendency calculations.

To start, see this XY scatter plot taken from a day’s worth of synthetic test runs. In this Story, are using data from Catchpoint’s US node network (Thank you, Catchpoint), measuring @ 3,500 times a day (about 170 per hour). Intentionally chose this webpage as it contained a third-party ad network having particular host issues (the waterfall data was invaluable for troubleshooting, but that’s a Story for another day).

Eyeballing the chart, notice the thick band of majority data is less than 5,000 ms (right around 1,500 – 3,000 ms) with thinner pockets and bands throughout. Also notice around between 10:00 AM – 02:00 PM, there were no measurements higher than around 14,000 ms.

XY Scatter Plot

Second, will take the above XY scatter plot and draw a bar graph representing the middle 25th-75th percentile range (See, “What is a Percentile”). The idea here is to show a middle range (which might better represent overall Performance) versus just a single line (which can sometimes ‘lie’ or misrepresent).

Middle Range

Third, using the same data from the XY scatter plot, overlay line charts showing respective Arithmetic Mean, Geometric Mean and Median calculations.

Arithmetic Mean VS Geometric Mean VS Median

Critical thing to notice is the height of the Arithmetic Mean (Y axis) versus either the Geometric Mean or the Median. Notice how the Arithmetic Mean is, at times, either very near the upper limit of the middle range or, in some cases, even above the upper limit of the middle range! Now notice the Geometric Mean and Median are always comfortably between the middle range.

Other:

Notice the 12:00 AM and 07:00 AM hour’s Arithmetic Mean is above the Middle Range. Now, quickly glance back at the XY scatter plot to see the measurement data.

Notice the middle range for the 02:00 PM and 03:00 PM hours are smaller than other hours. Glancing back at the XY scatter plot, can see the thick band of measurement data is more tightly packed.

Last, want to give a fair warning when looking at these types of charts: The amount of the data will generally affect the height and patterns of the lines and bars. Do not be caught off guard if, for example, the Arithmetic Mean average is always above your middle range. This is a function of the amount of data.

Document Complete / OnLoad:

_The following is optional reading material._

Download Excel document: https://docs.google.com/file/d/0B9n5Sarv4oonaDZSZXNURzZrd00/edit?usp=sharing

LinkedIn: http://www.linkedin.com/in/leovasiliou

Twitter: @LvasiLiou

#CatchpointUser #KeynoteUser #GomezUser #Webtortoise #Performance #WebPerformance

#ExcelStatistics #ExcelXYScatter #ArithmeticMean #GeometricMean #Median

2012-Nov-28

Configuring “Site is Slow” Performance Alerts

Response:

Hello!  This #WebTortoise post was written 2012-NOV-28 at 10:55 AM ET (about #WebTortoise).

Question & Answer

Question: I have various measurements continually recording the Response Time of my website. Now, though, I’d like to configure some Performance alerts to know if there is a Performance degradation, but I don’t know the exact settings to choose. So, how should I configure them?

Answer: First, notice this question is about Performance versus Availability. This is an important distinction because the alert settings would be configured differently for one versus the other.

Second, this question is looking for a “good enough” place to start. For example, if there is already a Response Time threshold set by Management, then the below Webtortoise Story may or may not be considered.

Now, regarding the question, the suggested answer is, “Consider using a Bayesian approach and study prior Rates of Change (explained in the below Story)”. Then consider how sensitive to configure the settings.

Fair warning, each measurement vendor would implement their alert modules in different ways and this below Story is only one specific example. The principle answer still applies, though:

Study prior Rates of Change.

Story

In Webtortoise World, is continually discussed how to measure website Performance and how to alert if it degrades. Have these conversations a lot, particularly with various Operation and Production folks who’d be receiving the alert emails (even in the middle of the night!).

Have all types of Availability alerts in place, but what if the site just slows (while still being technically available)? Maybe just need to tighten the settings a bit as the holiday season approaches? Maybe just getting a bit too many alert emails and people are starting to ignore them? Maybe just this? Maybe just that? Well, without further ado…

Step 01. Have a test measurement in place and let it run for a few days or a few weeks (the larger the sample size the better). The idea here is we’ll be looking “back” at the data to help determine the “forward” setting of the Performance alert.

Step 02. Decide the alert attributes. In this Story, we’ll be alerting on the Full Webpage Response Time metric, comparing the delta between the latter hour and the former hour. If the Rate of Change from one hour to the next is above a certain threshold, then send an alert email.

As mentioned, each measurement vendor would implement their alert modules in different ways. Please remember the attributes in this Story are only one specific example.

Step 03. Calculate the Rates of Change from one hour to the next. For example, if Response Time for the Midnight hour is 1,517 ms and if Response Time for the 01:00 AM hour is 1,503 ms, then the Rate of Change is 0.92% (1,517 minus 0.92% of 1,517 equals 1,503)(this Excel sheet contains the formulas for calculating this Rate of Change). If Response Time for the 01:00 AM hour is 1,503 ms and if Response Time for the 02:00 AM hour is 1,532 ms, then the Rate of Change is 1.93% (1,503 plus 1.93% of 1,503 equals 1,532).

May have noticed is being discarded whether the Rate of Change is positive or negative. For the purpose of this Story, that is okay.

However:

Note to all Performance measurement providers: Most have capabilities to alert on only Response Time INCREASES. Consider adding capability to alert also on Response Time DECREASES as they can be just as indicative of a problem.

Finish calculating the Rates of Change (In this Excel sheet, is calculated the Rates of Change for six weeks of test measurement data, by the hour (total of 1,008 hours). The formula in column D will always give a positive number (except when the Rate of Change is zero) and column D has been formatted to display a Percentage %).

Step 04. Now use a Frequency Distribution on the Rates of Change (for a refresher on Frequency Distributions, consider reading Webtortoise: What the Frequency?) to answer the question(s), “How many Rates of Change were less than 1%? How many Rates of Change were between 1-2%? How many Rates of Change were between 2-3%?” And so on.

The Frequency Distribution will answer these questions and, in the same Excel sheet, can see most Rates of Changes are between zero thru twenty’ish percent %. Now, given most Rates of Change, from one hour to the next, are less than 20%, should the alert threshold be set to less than 20%? …

Probably not.  Unless many alert emails are desired.

If the threshold setting is meant to alert in the most egregious of Performance degradations, then maybe set the alert threshold to 50% or greater. Looking again at the Frequency Distribution, can see a Rate of Change greater than 50% occurred eight times in the last six weeks. If the threshold setting is meant to alert in some other condition, then can look at the Frequency Distribution to get an idea of how sensitive the setting should be. At this point, consider other relative items to determine how sensitive the threshold setting should be. Otherwise, the threshold setting will come down to making a choice and iterating.

Document Complete / OnLoad:

_The following is optional reading material._

Here’s the traditional, time-based line chart for the test measurement used in this post.  It is for a 6-week period, by the hour, totaling 1,008 data.

Download the excel sheet here:  https://docs.google.com/open?id=0B9n5Sarv4oonZWJVMU9QTTlzSGM

Webtortoise Author on LinkedIn:  http://www.linkedin.com/in/leovasiliou

Webtortoise Author on Twitter:  https://twitter.com/Lvasiliou

#CatchpointUser #KeynoteUser #GomezUser #Webtortoise #Performance #WebPerformance

#ExcelStatistics #FrequencyDistribution

2012-Jun-28

What the Frequency?

Filed under: Performance — Tags: , , , , , — leovasiliou @ 01:43 PM EDT

Response:

Hello! This was written 2012-JUN-28 at 13:26 AM ET.

Question: What is a frequency distribution?

Answer: A frequency distribution is a powerful, non-linear way to analyze your website performance data. It is a way to show the number of times (frequency) a data or value appears in a given range or interval (distribution). Consider these two basic examples:

Example One:

Take the following numbers 1, 2, 4, 7 and 9. How many are between 1-5? How many are between 6-10? (Three numbers (numbers 1, 2 and 4) are between 1-5. Two numbers (numbers 7 and 9) are between 6-10.)

Example Two:

Take the following hypothetical test letter grades A, B, A, B, A, C, B and D. How many of each letter grades are there? (There are three “A”. There are three “B”. There is one “C”. There is one “D”.)

Now we will use Excel’s FREQUENCY command to distribute thousands of Webpage Response times. Fear not! Just see the attached excel sheet showing how we came up with the following frequency distribution chart:

Document Complete / OnLoad:

_The following is optional reading material._

Download the excel file here: https://docs.google.com/open?id=0B9n5Sarv4oonRVZGTFlmaWk4S3c

#ExcelArrayFormula #ExcelStatistics #CatchpointUser #KeynoteUser #GomezUser #FrequencyDistribution

2012-May-03

Half Full or Half Empty? Choosing a Statistical Calculation

Response:

Hello! This was written 2012-MAY-03 at 12:27 PM ET.

Question: In your life as a Keynote or Catchpoint user, suppose you have a day’s worth of website response time performance data. Should you average using the Arithmetic Mean or the Geometric Mean?

Answer: Assume you want to use a central-tendency calculation like a mean (we’ll talk about percentiles in another post). Since the Geometric Mean will result in a lower value than the Arithmetic Mean, you might say the Geometric Mean is an “optimistic calculation” where the Arithmetic Mean is a “pessimistic calculation”. See the below going from raw data to line chart, then decide for yourself when to use either of the two calculations.

First, the below scatter plot shows a day’s worth of website response time data. In this case, there are about 2,880 total data, or about 120 per each respective 24 hours in a day. Notice the single spurious outlier in the 03:00 AM hour.

Second, the above scatter plot is then transformed into the below line graph. The blue line calculates the data using the Arithmetic Mean and the red line calculates the same set of data using the Geometric Mean. The single spurious outlier in the 03:00 AM hour caused a “spike” to appear in the line graph.

Fair warning, the perceived impact (a.k.a. “The Spike”) will depend on how many data are being averaged. The point remains the same, though: The Arithmetic Mean is more subject to skew from spurious outliers than the Geometric Mean. So, depending on your situation, you may want to choose one or the other (or both to see the delta, which is another powerful way to analyze).

Document Complete / OnLoad:

_The following is optional reading material._

For a refresher on calculating the geometric mean in excel, see this post https://webtortoise.com/2012/03/16/geometric-mean-in-excel/

Download the excel file, which includes the raw data and the line chart calculations, here:  https://docs.google.com/open?id=0B9n5Sarv4oonRGc5UmNpNHhjOGs

#Keynote ; #Catchpoint ; #Gomez ; #KeynoteUser ; #Catchpointuser ; #Gomezuser ; #ArithmeticMeanvsGeometricMean ; #ExcelStatistics

Blog at WordPress.com.