Select Page

In most qualitative research studies, such as focus groups or depth interviews, one of the challenges is to transcribe the audio recordings into written text for analysis. The most common method used so far is manual transcription, which is both financially and time-consuming. In this text, we explore the possibilities of using Artificial Intelligence (AI)…


The article discusses the challenges of transcribing non-English language focus group and in-depth interview recordings and explores the possibilities of using Artificial Intelligence (AI) for transcription. Manual transcription is time-consuming and expensive, and AI can be used for speech-to-text transcription. The article recommends using the Whisper model from OpenAI for the cheapest option and Microsoft Teams for online interviews. MS Word 365 and Whisper AI are recommended for recordings, personal interviews, focus groups, and speech-to-text transcription. The article also emphasizes the importance of high-quality audio and provides tips for improving audio quality.

MS Word for web (365)

MS Teams

  • a) Live transcription + transcription saving (= does not support recording transcription)
  • b) saving audio/video recording, file will disappear after some time (details:
    • EN:
    • Required license: Office 365 Enterprise E1, E3, E5, F3, A1, A3, A5, M365 Business, Business Premium, or Business Essentials.

Google Speech to text api

  • Speech Recognition (without Data Logging – default): 0-60 Minutes – Free; Over 60 Minutes – $0.024 / minute
  • Not tested

Web services

  • none of the services had a convincing transcription to Czech

  • 180 minutes/10 USD, 990 minutes/49 USD
  • credit (pay as you go, not monthly payment)

  • 0.02 USD/min

Google Recorder (not tested)

  • on Pixel phones, saving transcription to the cloud on newer phones

Whisper model – custom installation

  • advantages: fast transcription, free
  • disadvantages: not accurate – requires corrections, no speaker identification (diarization)
  • speaker identification (diarization) – can be bypassed through additional modifications
  • for reference: transcription of a 10-minute conversation takes 6 minutes of computational time (on Google hardware), but it should fit into the free tariff

Quality audio is required

Pay attention to:

  • voices not overlapping,
  • letting respondents finish their sentences,
  • refraining from loudly expressing understanding to the respondent – sticking only to non-verbal expressions (nodding to indicate understanding), even though it can be difficult.


For online interviews:

  • MS Teams

For recordings/personal interviews/focus groups/speech-to-text transcription:

  • MS Word 365
  • Whisper AI – custom installation
  • try to have the best possible audio quality.

Credit: Concept writen by human, text writen by human, translated to english by AI/CHatGPT, summary by AI/ChatGPT