Importance of data preparation when analysing written responses to open-ended questions

Importance of data preparation when analysing written responses to open-ended questions: An empirical assessment and comparison with manual coding

Research output: Contribution to journal › Journal article › Research › peer-review

Sara R. Jaeger
Rasmussen, Morten Arendt

In a world where consumer texts grow more numerous each day, automated text analysis can deliver valuable insights about consumer attitudes and behaviours. The present research was methodological in nature and focused on pre-processing of text data, which generally is the most time-consuming stage of analysis. Using responses to an open-ended question from 4341 consumers, document-term matrices (DTM) were created from varying combinations of n-grams (unigrams, bigrams, trigrams and combinations hereof), stemming (yes or no) and low-frequency term thresholding (retaining all terms or excluding those used < 0.1%, <1% or < 5%). By comparison to a fixed standard – manually derived content coded of respondents’ answers – the relative impact of the three pre-processing steps were assessed. PLS-DA was used to do so, and classifier performance was evaluated using AUC-ROC scores. Inclusion of bigrams and trigrams in DTMs did not influence classification performance and stemming had only a minor impact. Inclusion of all and very rare features (<0.1%) improved classification performance. The results were invariant of sample size and replicated in subsets of 2000, 1000 and 500 participants. The results may be specific to the short length of the answers (median words = 4), although they held in a sub-sample of the 500 longest answers (median words = 41). Future research should directly test the influence of these pre-processing steps, for example, through topic modelling.

Original language	English
Article number	104270
Journal	Food Quality and Preference
Volume	93
Number of pages	14
ISSN	0950-3293
DOIs	https://doi.org/10.1016/j.foodqual.2021.104270
Publication status	Published - 2021

Research areas

Low frequency threshold, Manual content analysis, N-grams, Open-ended questions, Pre-processing, Sample size, Stemming, Text mining

ID: 272576808