Importance of data preparation when analysing written responses to open-ended questions: An empirical assessment and comparison with manual coding

Research output: Contribution to journalJournal articleResearchpeer-review

In a world where consumer texts grow more numerous each day, automated text analysis can deliver valuable insights about consumer attitudes and behaviours. The present research was methodological in nature and focused on pre-processing of text data, which generally is the most time-consuming stage of analysis. Using responses to an open-ended question from 4341 consumers, document-term matrices (DTM) were created from varying combinations of n-grams (unigrams, bigrams, trigrams and combinations hereof), stemming (yes or no) and low-frequency term thresholding (retaining all terms or excluding those used < 0.1%, <1% or < 5%). By comparison to a fixed standard – manually derived content coded of respondents’ answers – the relative impact of the three pre-processing steps were assessed. PLS-DA was used to do so, and classifier performance was evaluated using AUC-ROC scores. Inclusion of bigrams and trigrams in DTMs did not influence classification performance and stemming had only a minor impact. Inclusion of all and very rare features (<0.1%) improved classification performance. The results were invariant of sample size and replicated in subsets of 2000, 1000 and 500 participants. The results may be specific to the short length of the answers (median words = 4), although they held in a sub-sample of the 500 longest answers (median words = 41). Future research should directly test the influence of these pre-processing steps, for example, through topic modelling.

Original languageEnglish
Article number104270
JournalFood Quality and Preference
Volume93
Number of pages14
ISSN0950-3293
DOIs
Publication statusPublished - 2021

    Research areas

  • Low frequency threshold, Manual content analysis, N-grams, Open-ended questions, Pre-processing, Sample size, Stemming, Text mining

ID: 272576808