Kevin Hillstrom: MineThatData

Exploring How Customers Interact With Advertising, Products, Brands, and Channels, using Multichannel Forensics.

January 03, 2008

Testing Issues

Recall that my focus in 2008 is on multichannel profitability.

Experimental design (aka 'tests') is one of the most useful tools available to help us understand multichannel profitability.

We run into a ton of problems when designing and analyzing 'tests'. Let's review some of the problems.


Problem #1: Statistical Significance

Anytime we want to execute a test, a statistician will want to analyze the test (remember, I have a statistics degree --- I want to analyze tests!).

In order to make sense of the conclusions, the statistician will introduce the concept of "statistical significance". In other words, the statistician will tell you if the difference between a 3.0% and 2.9% click-through rate is "meaningful". If, according to statistical equations, the difference is not deemed to be "meaningful", the statistician will tell you to ignore the difference, because the difference is not "statistically significant".

Statisticians want for you to be right 90% of the time, or 95% of the time, or 99% of the time.

We all agree that this is critical when measuring the effectiveness of a cure for AIDS. We should all agree that this isn't so important when measuring the effectiveness of putting the shopping cart in the upper right-hand corner of an e-mail campaign.

Business leaders are seldom given opportunities to capitalize on something that will work 95% of the time. Every day, business leaders make decisions based on instinct, on gut feel, not having any data to make a decision. Knowing that something will work 72% of the time is a blessing!

Even worse, statistical significance only holds if the conditions that existed at the time of the test are identical to the conditions that exist today. Business leaders know that this assumption can never be met.

Test often, and don't limit yourself to making decisions only when you're likely to be right 99% of the time. You'll find yourself never making meaningful decisions if you have to be right all the time.


Problem #2: Small Businesses

Large brands have testing advantages. A billion dollar business can afford to hold out 100,000 customers from a marketing activity. The billion dollar business gets to slice and dice this audience fifty different ways, feeling comfortable that the results will be consistent and reliable.

Small businesses are disadvantaged. If you have a housefile of 50,000 twelve-month customers, you cannot afford to hold out 10,000 from a catalog or e-mail campaign.

However, small business can afford to hold out 1,500 twelve-month customers out of 50,000. The small business will not be able to slice and dice the data the way a large brand can. The small business will have to make compromises.

For instance, look at the variability associated with ten customers, four of which spend money:
  • $0, $0, $0, $0, $0, $0, $50, $75, $150, $300.
    • Mean = $57.50.
    • Standard Deviation = $98.63.
    • Coefficient of Variation = $98.63 / $57.50 = 1.72.
Now look at the variability associated with measuring response (purchase = 1, no purchase = 0).
  • 0, 0, 0, 0, 0, 0, 1, 1, 1, 1
    • Mean = 0.40.
    • Standard Deviation = 0.516.
    • Coefficient of Variation = 0.516 / 0.40 = 1.29.
The small company can look at response, realizing that response is about twenty five percent "less variable" than the amount of money a customer spent.

Small companies need to analyze tests, sampling 2-4% of the housefile in a holdout group, focusing on response instead of spend. The small company realizes that statistical significance may not be achievable. The small company looks for "consistent" results across tests. The small company replicates the rapid test analysis document, using response instead of spend.


Problem #3: Timeliness

The internet changed our expectations for test results. Online, folks are testing strategies in real-time, adjusting landing page designs on Tuesday morning based on results from a test designed Monday morning, executed Monday afternoon.

In 1994, I executed a year-long test at Lands' End. I didn't share results with anybody for at least nine months. What a mistake. We had spirited discussions from month ten to month twelve that could have been avoided if communication started sooner.

Start analyzing the test right away. Share results with everybody who matters. Adjust your results as you obtain more information. It is ok that the results change from month two to month three to month twelve, as long as you tell leadership that results may change. Given the fact that the online marketers are making changes in real-time, you have to be more flexible.


Problem #4: Belief

You're going to obtain results that run contrary to popular belief.

You might find that your catalog drives less online business than matchback results suggest. You might find that advertising womens merchandise in an e-mail campaign causes customers to purchase cosmetics.

You might find that your leadership team dismisses your test results, because the results do not hold up to what leadership "knows" to be true.

Just remember that people once thought the world was flat, that the universe orbited Earth, and that subprime mortgages could be packaged with more stable financial instruments for the benefit of all. If unusual results can be replicated in subsequent tests, the results are not unusual.

Leadership folks aren't a bunch of rubes. They have been trained to think a certain way, based on the experiences they've accumulated over a lifetime. It will take time for those willing to learn to change their point of view. It does no good to beat them over the head with "facts".

Labels:

4 Comments:

At 2:21 PM , Anonymous Anonymous said...

I think your standard deviation for the response data is a baised estimate sqrt(sum(x_i-x_mean)^2/(n-1)), calculated to 0.516, as opposed to the unbiased estimate sqrt(sum(x_i-x_mean)^2/(n)) or sqrt(p*(1-p)), p is the response rate, which evaluates to 0.49.

I think the size of the housefile determining the sample size for the test is a heuristic. The sample size for the test should be based on the how much variation is the data (the underlying universe for the test), measured in standard deviations and how small change in effect do we want to see (e.g. the change in response rate for the test VS control).

Thank you.

 
At 4:05 PM , Blogger Kevin Hillstrom said...

Hello Anonymous.

Imagine all the folks who are being misled by SPSS' Explore Procedure and Microsoft Excel's STDEV function, both of which provide the biased estimate you criticized.

Theoretically, you're 100% correct, my readers will appreciate the clarification you offered.

Now, what I need from you is a solution to a problem.

Take the catalog CEO of a small business. This CEO has a housefile of 20,000 customers she can mail catalogs to.

If she wants to follow your prescription, and use standard deviations to measure the change in response for test vs. control, she may get the answer that she needs to hold out 8,000 customers from an upcoming catalog mailing, in order to learn something at a level of accuracy that pleases a statistician.

In other words, she needs to lose 40% of her revenue in order to learn what she needs to learn in a way that satisfies a statistician.

The CEO will not go for this. Nor will she go for holding out 4,000 people, losing 20% of her revenue.

I bring this up because I run into this problem every day --- I calculate the sample size needed to measure $/book, and the math doesn't work out favorably.

So, I need a solution from you. Tell this audience of catalog executives how you would solve this problem for this CEO?

 
At 5:17 PM , Anonymous Anonymous said...

Thank you.

In such cases, I usually play around with some parameters, the probability of observing a x% of the change of effect I want to see or the significance level. For example, if the test was designed at 2% response rate (or at X1 dollar per book (DPB), Y1 std.err of DPB), with a 10% change of effect at 90% confidence level and the sample came out to be N1. I tinker with N1, confidence level, and observable % change of effect. I reduce N1, and recompute the probabilities to observing 50-60-70-80-99% of the 10% change of effect at numerous confidence levels. One can play around with difference (test VS control) distributions and sample size and confidence level to reduce original sample size considerably.

For a small housefile (in fact for any size of housefile), the economics would decide the whether to run a test or not. Another way is to split the housefile into some segments and see how much is there "there" to get some measurable observations. May be standard errors in average order value is pushing towards big sample sizes. Generally, the smaller the change we want to see and/or the lower our response rate, the larger the sample size we need. The sample size required for the segment with best customers will be mush smaller than N1 above. If flexibility in running a test is allowed, the economics of stratified samples can be worked out see the how much revenue would be lost/gained due to the testing.

It is very important to keep in mind that how much change in effect we can observe at a given confidence level. It is not a all-or-nothing problem. Theoretically one can interpret a test results with any amount sample sizes at a certain confidence level and a change in effect of response. But can we move in different dimensions (confidence level, % change in effect) and obtain the same conclusions from the observations.

 
At 6:27 PM , Blogger Kevin Hillstrom said...

Now you're doing better, thanks.

 

Post a Comment

Links to this post:

Create a Link

<< Home