What are we counting?

People keep asking me what I think about quality metrics, the audience research system that Arts Council England (ACE) will shortly require its largest National Portfolio Organisations (NPOs) to use.

When I try to answer this complex question, many immediately tell me they were asking confidentially and don’t want their own views known. I hear a lot of reservations and many worries, but everyone seems reluctant to say anything during the current NPO application process.

Sadly for Welsh National Opera, the majority who said when and where they had seen an opera, turned out to not actually have attended an opera at all

Whilst understandable, this is not helpful. It is surely essential to embark on a proper discussion of whether this will deliver reliable results for NPOs and ACE, and to address people’s concerns.

Uneasy questions

I have been a champion of audience data for a long time. I conducted my first year-long audience survey at the Vic in Stoke on Trent in 1969, supervised by Keele University. I have been commissioning research surveys for over 40 years and the Arts Council published my book ‘Boxing Clever’ on turning data into audiences in 1993. And I have collaborated with them on many audience initiatives, including the drive to place socio-economic profiling tools at their NPOs’ fingertips.

So, I ought to be welcoming the concept of quality metrics and what Culture Counts proposes to deliver for Arts Council England. I can see why Marcus Romer (read his blog from 27 September) would welcome the voice of the audience, as end-recipient of the art, into ACE thinking. But I am left with a lot of uneasy questions, mostly methodological.

Unreliable research

Most people with any knowledge of research methodology are asking the same questions, because this type of research is inherently unreliable, yet a lot of reliance is being placed on the findings.

The Arts Council’s own former Senior Marketing Officer, Peter Verwey, constantly reminded arts marketers of the inherent unreliability of audience surveys, unless there were controls to manage the sample. Even then, reliability depends on respondents understanding the questions. If you ask a question and the respondent can’t ask for clarification on what the question means, then the answers can’t be relied upon. But if explanations are given, then bias creeps in, depending on what is said to them.

At the Arts Council of Wales, we used Beaufort Research to check respondents’ understanding of some simple questions about the arts, including: “When did you last attend an opera?” Sadly for Welsh National Opera, the majority who said when and where they had seen an opera, turned out to not actually have attended an opera at all. The public have a very different understanding of the words we use to discuss the arts, and this can have a significant impact on whether survey questions are completed.

This is an inevitable drawback of quantitative research. Researchers have to decide in advance what precise questions to ask and have to constrain answers to a fixed choice. Qualitative write-in answers can’t produce reliable, comparable results, even though narrative answers can provide the richest source of our understanding of what a specific audience member thought.

Biased responses

Audience surveys have other equally large flaws. Peter Verwey’s joke was that the survey samples usually comprised “anyone who had a working pen/pencil when the survey was handed out”, though that has presumably changed to whether people have an email address and bother to open survey emails.

Surveys conducted in foyers after performances are inherently biased in that they capture only those with time to answer. And even “there is an app for that” only suits the tech savvy.

Analysis over the years shows that completion is biased in favour of the most supportive members of the audience and those keen to make their views known, sometimes complainants. You can overcome some of this by ruthless random sampling – only looking at the feet of the people to be selected to answer the questionnaire, for example – and similar techniques can be applied online. But the bias, of who actually responds when invited to, remains.

These days, when we are capable of creating a socio-economic profile of attenders who book tickets, we ought to, as a minimum, be expecting the quality metrics methodology to include a check for the representativeness of the sample.

Incomparable performances

Perhaps the biggest challenge is that audience surveys are inherently situational. They can only reflect what happened in a particular venue on a specific date and time, and gather the opinions of the people who both attended and chose to respond. If you have ever been a house manager and experienced the difference that a large group booking can make in an audience, you will understand the potential variability.

This makes comparability from one performance – let alone event – to another very difficult. Researchers have known about these issues for decades and therefore attempts to measure or assess impact based on audience surveys are always approached with huge caution, even if conducted for a single venue or a single performance.

Unexplored impact

Arts Council England’s own 2014 literature review by WolfBrown was clear about this: “the literature raises questions as to the plausibility of aggregating survey data across organisations and artforms, due to the highly personal and situational nature of impact, and because of differences across the forms themselves.”

It can be argued that valuing art based primarily on the experiences it produces, in fact devalues the work itself. Can you really tick a box to encompass your opinion? Indeed, post-event surveys primarily measure the ‘experienced impacts’, perhaps within a day or so, and ignore the ‘extended impacts’, probably weeks or even years later (typically assessed through retrospective interviewing and longitudinal tracking studies).

And while we try to understand these impacts on each individual, what role did pre-attendance marketing, the venue, pre-show talks, the people who attended with them, and the rest of the audience, have on the experience? Some researchers have expressed serious concerns about comparing self-reported audience experiences across different artforms and contexts because of the huge range of impossible-to-control variables being measured in these, in effect, crowd-sourced reviews.

Flawed evaluation

I had expected recent reports, commissioned by ACE, to provide the answers. I was surprised to find the report on the quality metrics national test was assessed and written by two staff from the company that ran the pilot scheme, John Knell and Alison Whitaker. So the researchers were being asked to mark their own homework. Highly unusual, regardless of their integrity.

The Arts Council did commission an independent evaluation, though this only examined the experience of the organisations participating in the National Test, and not the methodology used in the pilot or the internal processing of the resultant data. Nonetheless, that evaluation reported some serious concerns raised by participant organisations about the methodology, saying “the majority of consultees questioned the reliability of the resulting data because of the sample frame, in terms of its representation and size” and commenting that “this aspect evidently impacted the organisations’ use of the data, with organisations unconfident to draw any firm conclusions, unable to 'convince' programmers of its value, and unsure of what ‘robust’ would look like in practice." It went on to say “consultees suggested a number of areas where unintended bias or skewed data had the potential to be introduced. It is evident that these elements contributed to consultees’ overall opinion that the resulting data did not accurately reflect the quality of their work."

Knell and Whitaker's report makes no reference to statistical significance or reliability, or the representativeness of audiences; and despite references to “highly sensitive aggregation” there is no explanation of the basis for that data aggregation, except for some crude geo-location, artform, gender-based data-merging. It’s impossible to discern how they have overcome the huge problems of situational audience surveys and event comparability.

There is also no explanation of how audience responses are related to the other elements of the triangulated quality metrics research process, namely peer responses and internal assessments. Neither is there an indication of how respondents were selected. Indeed, the report is more about the findings of the surveys than testing the reliability of the methodology or its underlying fitness-for-purpose and statistical reliability.

Sampling problems

Obviously it is easiest to select survey respondents from ticket bookers with email addresses, and some of the organisations that participated in the pilot research chose people with particular characteristics, or a certain frequency of attendance. Some indicated that they wanted to input the data findings into their CRM systems. Did they select target samples accordingly?

Some added extra questions of their own to the survey, which in themselves might have affected understanding, response rates and completion. There is no explanation of how these additional questions were tested for respondent understanding. Also, only ‘30 responses’ is cited as an acceptable minimum for an event to be evaluated. How does this relate to the total attendance? There is no rationale given for this low number and no indication how an event with 30 survey respondents will be compared with an event with 300.

What’s more, there is no indication of how any of this will be possible under the new General Data Protection Regulation and its specific granular consent regime, which could further reduce the number of attenders available for survey and the use and processing of their responses.

The better news is that over 19,000 surveys were completed in the national test. This is clearly a large sample in UK terms, but size is not enough, especially when the integrity of the sample is unclear. We can’t rely on the national sample size if we need local reliability. We need to understand the reliability of the findings for each individual organisation in their unique catchment area. And we need to know the profile of the survey respondents in the context of both the universe of NPO attenders, and the actual attenders at each individual organisation.

Finally, there are of course other providers of post-attendance survey tools, and arts organisations already carrying out frequent surveys of attenders are worried about wear-out from over-surveying core attenders. All their other surveys are intended to understand audiences better and guide marketing, operational and audience development issues, not inform critical ACE grant award decisions.

I write this because I find ACE has a lot of questions to answer if it is to reassure arts organisations about the methodology and the quality of its proposed metrics. Just what is it counting, and exactly how?

Roger Tomlinson is an internationally recognised arts marketing and management consultant, now “mostly retired”.

This article is Roger’s personal opinion and does not reflect the views of his colleagues or any other organisations.

This article was updated on 1 June 2017 with the agreement of Roger Tomlinson, in response to Arts Council England’s assertion that, for clarification, his article should mention the independent evaluation report they commissioned.

Uneasy questions

Unreliable research

Biased responses

Incomparable performances

Unexplored impact

Flawed evaluation

Sampling problems

Join the Discussion

Head of Development

Head of Operations

Head of Building and Facilities

Chief Executive

Director