Semantic advertising contextual corpus

A reader recently commented on my semantic advertising post with a great question: exactly how well does semantic advertising do compared with plain old contextual advertising? Can the difference be quantified in a way that’s independently verifiable?

It should be.

I understand why it’s not today. There are hundreds of advertising networks, each with their own techniques for targeting and placement, locked in frenzied competition in a young market with few barriers to entry, a constantly evolving technical landscape, and the desire to protect trade secrets and to maintain — how to say it? — creative license with how they market their advantages based on their own in-house tests and statistical evidence.

But as semantic advertising and contextual advertising mature and converge, the need for an industry-standard objective test of contextual/semantic accuracy becomes more pressing.

One solution would be the creation of a Semantic Advertising Contextual Corpus (SACC). This text corpus would be a collection of pages that provides lots and lots of cases of ambiguous keyword contexts. It should probably be broken down into different categories, such as travel or high-tech or consumer packaged goods (CPG).


The key idea of the corpus would be that a number of the sample pages in each category would be “traps” that purposely try to fool contextual/semantic advertising algorithms into accidentally placing the wrong ads in that location. For instance, in a travel category, you might have a number of harmless stories about great travel experiences and travelogues, but every so often, you’d have a page that talks about an airplane crash, or food poisoning on a cruise ship, or some other such disaster where advertisers would not want their ads to be included. These are negative branding scenarios to be avoided at all cost.

Aside from the worst cases of negative branding, there would also be opportunities for mistaken identity. For example, an anti-virus software company might want to advertise for “computer security”, but wouldn’t want to waste ads dollars appearing in contexts talking about national security, or home security systems, or loan security, etc., even if the word “computer” appears in proximity to “security” in those pages.

You could then test several different advertising solutions in this corpus — apples-to-apples — and see how well they do.

Granted, since nothing is perfect when it comes to language, there would always be some margin of error for how well different algorithms perform. But if constructed properly, this test should give some genuinely useful and objective statistics about different contextual/semantic algorithms. Ideally, the results would be broken down across each category and separately for avoiding negative branding vs. mistaken identity.

With granularity in the performance results, you would then be able to recognize algorithms that were particularly good at dealing with context for, say, travel-related advertising, while others might be tuned towards B2B advertising in high-tech. It’s quite likely that vertical ad networks would excel in their particular category — at least that’s what you’d expect.

This would be a way to quantify the accuracy of the different players.

Now, to do this right, the corpus would need to have updated versions — maybe once per quarter? — so that tests could be run “fresh”. This would have several benefits. First, it would make up for any statistical luck-of-the-draw that a given algorithm might have in its performance in one version of the test. Second, it would prevent vendors from “building to the test” to artificially boost their score. Third, it would show performance trends over time — both for individual vendors and the space as a whole.

Such a corpus would also be useful to the developers working on these algorithms as a way to experiment with new ideas.

Admittedly, creating and maintaining this corpus would be a lot of work. Administering tests would be even more. But this could be a great opportunity for a university research group or an independent research firm to stake its claim as the authority in contextual and semantic advertising accuracy.

Leave a Reply