Digital is easy. Try A/B testing in the real world.

If I was going to recommend one book for an executive to read about the power of controlled experiments in business, it would be Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society by Jim Manzi. It’s impossible to read this book and not walk away enthused about the potential of testing and experimentation.

Jim is the founder and chairman of Applied Predictive Technologies (APT), which bills itself as the world’s largest, purely cloud-based, predictive analytics software company.

But one of their key specialities has been enabling companies to better run controlled experiments — what they call test-and-learn — not just in digital, but in the physical world and complex multi-channel environments.

They’ve worked with Walmart, McDonald’s, Victoria’s Secret, Walgreens, Starbucks, Shell Oil, Food Lion, Hilton, AutoZone, Payless, Red Lobster, Staples, Denny’s, PetSmart, Barnes & Noble, RadioShack, Guitar Center, Lowes — the list goes on and on. Their credentials in controlled experiments in business, particularly retail, are unparalleled.

Jim has been one of my heroes ever since I read his book. And so you can imagine my delight when I was able to interview him for the following Q&A:

Can you start by telling us a bit about your background and the work that APT does?

I studied math at MIT and then after graduate school, I worked briefly with AT&T in the labs. I then worked for about 10 years for a corporate strategy consulting firm that was spun out of the Boston Consulting Group with the concept of applying information technology and deep data analysis to corporate strategy.

While there, I came to the point of view, by the end of the time I was there, which was the late 1990’s, that improving decision making by large companies would be really enhanced if they could run tests and experiments to test to their theories. And this would be better done via software, for a number of reasons, rather than trying to do it through services.

So a couple years after leaving consulting, I started APT with the concept of applying rapid, iterative experimentation — which we call test and learn — to help large, typically consumer-focused companies know the cause-and-effect relationship between proposed business programs and financial and other kinds of outcomes.

So at a very practical level and a simplified example: if I run a large retailer, and somebody comes along and says, “I’ve got a training program, that if you do this for the folks who work in your stores, for your sales associates, your stores will sell more, and you will make more money.” You can kind of run focus groups and theorize about it and so on. Or you can just go into a small number of stores, train some of the sales associates, and see what happens.

You can kind of run focus groups and theorize about it and so on. Or you can just go into a small number of stores, train some of the sales associates, and see what happens.

You might think, if you know your business well, that if you go in and train the associates at ten stores, you ought to be able to look at sales trends and so on and determine if it helped or not. But it turns out it’s usually a lot trickier than you might think to be able to do that. Basically, because typically a million other things are changing at the same time. The weather got really bad in three of those ten stores during that test period, and one of the managers changed out, and that list is almost infinitely long.

So what our technology is designed to do is support our clients in the design and interpretation of those in-market tests, so that they can reliably predict the effects of the program, understand if they can be targeted — meaning, is it going to work better for some customers, some stores, some markets, and so on, than others — and, third, can it be improved? Can we figure out through testing that, in fact, this part of the program is effective, for example, and that part is not?

There’s increasing talk about experimentation in digital marketing, such A/B testing on websites. Your company engages in these real-world experiments that we don’t hear as much about. What’s your perspective on similarities between testing in the digital space versus in the physical world?

The big difference is a math problem. They are structurally identical in terms of the analysis at a certain level of abstraction. Every test is really, ultimately an A/B test. And our technology is used in a digital environment and in a multi-channel environment. For example, I am running a test: what is the causal effect of, say, a search heavy-up on the total performance of my business, not just the online channel, but the total business. And so, there’s a lot of application of our technology directly to testing in the digital world.

If you think about businesses or business channels of distributions, they are divided into two types. One, I’ll call direct channels of distribution: web sites, email, outbound snail mail, and so on. And the other is any kind of distribution that happens through a distribution system, say, a set of stores, a salesforce, branches, ATMs if you’re a bank, etc.

If you are a pure online web retailer, to keep it simple, running an A/B test is analytically very straightforward. I randomly assign 50,000 view sessions to see the website with a red background, randomly assign 50,000 view sessions to see the website with a yellow background, and I measure click-through and behavior differences and so on, and I just determine from the test which works better.

The problem when I am a multi-channel marketer, and I want to apply that concept to a physical channel of distribution, is that I can’t randomly paint 50,000 stores one color, and 50,000 stores another color. I can paint 18 of them or 12 or 63.

I can’t randomly paint 50,000 stores one color, and 50,000 stores another color.

When the sample size gets small, what happens is, to be nerdy for a minute, the Law of Large Numbers doesn’t really apply, and therefore a lot of classical statistics breaks down because the sample sizes are too small. That is, believe it or not, the root reason in our observation, why it is so hard to apply testing in the 94% of the economy that is not online commerce. And really, that’s the math problem again, for which we built a huge amount of algorithmic technology.

Part of this is the technical challenge. But there was a survey by the Corporate Executive Board, where they asked Fortune 1,000 marketers, how good is your organization at test-and-learn experiments. Something like 74% ranked themselves less than effective. How much of that is technical? How much of that is culture? What’s your perspective?

Well, there’s some of both.

We have about 20% of the data in the consumer economy in the United States going through our system. I think that the majority of large companies who are trying to do testing — again, in anything other than a pure, direct environment, where it’s a lot more straightforward — really do have a technical challenge.

And the way it manifests itself is, I run a test, and I can have lots of reasonable ways of interpreting the test, which lead to different conclusions. And therefore, it’s a very frustrating experience for most organizations. That is actually a big problem for a majority of the large marketers in the U.S. We now have very good market share, but it’s less than half.

Second, in any organization that attempts to use any kind of data to make decisions, as you well know, there are all kinds of organizational realities that make it non-trivial to do that. You need executive commitment. You need reasonably well-designed business processes to generate and use that information. You need the right organizational structure.

In any organization that attempts to use any kind of data to make decisions, as you well know, there are all kinds of organizational realities that make it non-trivial.

Both are problems, but in effect, there’s a chicken-and-egg issue. Because, unless you do the kind of things we’re doing, it’s very difficult to get a clear read from a test. It is therefore the case that it’s very difficult to build all the other stuff that has to go around it, in terms of organization design, business process design, training, executive commitment. Because all of that is built on a foundation of sand if you can’t basically run an A/B test and read it correctly, which is the root of all of this. Which again, if you try to deploy the standard methods, you can’t really do that correctly outside of a digital environment or outside of a direct channel environment.

One thing in particular: deciding what to test and how bold the alternatives that are being tested are. What’s your sense of the appetite of companies to try more daring experiments versus less adventurous alternatives?

We call the distinction you’re drawing: testing the right things versus testing things right.

A company with a well-run testing function will have a portfolio of tests being run in a year that range from quite incremental, operational tests to more visionary, bold tests. An optimal portfolio does have a mix of one to the other and shades of gray in between.

There are two kinds of risks that tests can create. One is brand risk. Many companies, no matter their level of testing capability, understandably and correctly, are not going to run tests that they believe will put their consumer brand at risk. And I don’t think that will ever change as result of championing testing.

Companies, no matter their level of testing capability, understandably and correctly, are not going to run tests that they believe will put their consumer brand at risk.

A different kind of risk you can create through testing is economic risk. Let’s say I’m going to test an extremely deep discount. When I do the math on the percentage of my business I need to put at test in order to measure that, it can be non-trivial. And therefore, if I believe my current discount structure is correct, I am pro-forma putting a significant total number of dollars at risk.

As a testing function becomes more effective, typically the willingness to do that goes up to some extent. Especially because the organization learns that through testing they can create break-out insights. Most of those more extreme tests will not prove out an alternative. You know, there’s a new champion, there’s an alternative in that hypothetical example discount structure that makes more money. But in this drilling-oil-wells sense, I become more willing to try those because I will have a low percent hit rate, but enormous total dollar value of success, across a series of tests that we need to do that.

So I think that organizations will tend to up the mix of more outside-the-box tests as their testing capability improves. The boundary that doesn’t ever change is putting the brand at risk.

The recent explosion of testing software just in the digital space — it feels like there’s been a considerable expansion of the number of people who are willing to just jump in and start experimenting with tests. How do you look at that? Is this generally a good thing that we’re getting more people testing? Or are there risks people are taking by jumping in without considering issues such as brand risk or economic risk?

You know, I think we’re on the right side of history.

An increasing proportion of business decisions will be made based on test-and-learn experiments over time. I think that it’s rational. I think technology allows us to design and interpret correctly many more kinds of experiments at low cost than we could have even 10 years ago.

I think we’re on the right side of history. An increasing proportion of business decisions will be made based on test-and-learn experiments over time.

Obviously, when any new function is introduced, which is not marginal but is at the economic heart of the business, the risks of doing it incorrectly — the economic cost of doing it incorrectly — has already risen and will continue to rise. I think in that way, it’s like any other activity which has gone on for a long time, but is becoming now a more systematized process and sitting more at the heart of the business.

One of the things I find fascinating about the book you wrote is your exploration of test-and-learn beyond business settings, to help with social challenges and government. How is that idea developing? Is there reception for it in some quarters? Do you think we will see more of that in the years ahead?

Yes, I do.

The reception on that front has been very positive.

In fact, if you go to the current year, President’s budget for the United States, you’ll find a paragraph in there on the desire of the federal government to institute more rapid, iterative testing of the kind that’s been done in businesses, which is a direct outgrowth of these kind of conversations.

If you go to the current year, President’s budget for the United States, you’ll find a paragraph in there on the desire of the federal government to institute more rapid, iterative testing.

All innovation faces resistance. I don’t expect the world to change tomorrow morning. But in general, I’ve been very pleased with the reaction I’ve seen.

Thank you, Jim!

Get chiefmartec in your inbox

Join 42,000+ marketers and martech professionals who get my latest insights and analysis.