Overview of options for transcribing non-English language focus group and in-depth interview recordings using AI (Artificial Intelligence)

In most qualitative research studies, such as focus groups or depth interviews, one of the challenges is to transcribe the audio recordings into written text for analysis. The most common method used so far is manual transcription, which is both financially and time-consuming. In this text, we explore the possibilities of using Artificial Intelligence (AI)…

Summary

The article discusses the challenges of transcribing non-English language focus group and in-depth interview recordings and explores the possibilities of using Artificial Intelligence (AI) for transcription. Manual transcription is time-consuming and expensive, and AI can be used for speech-to-text transcription. The article recommends using the Whisper model from OpenAI for the cheapest option and Microsoft Teams for online interviews. MS Word 365 and Whisper AI are recommended for recordings, personal interviews, focus groups, and speech-to-text transcription. The article also emphasizes the importance of high-quality audio and provides tips for improving audio quality.

MS Word for web (365)

Transcription of recordings (speech to text) / live transcription
300 minutes per month
Required license: Microsoft 365 (verification needed, subject to change)
Info: https://support.microsoft.com/en-us/office/transcribe-your-recordings-7fc2efec-245e-45f0-b053-2a97531ecf57
Supported languages: see link above

MS Teams

a) Live transcription + transcription saving (= does not support recording transcription)
- For customers with the following licenses: Office 365 E1, Office 365 A1, Office 365/Microsoft 365 A3, Office 365/Microsoft 365 A5, Microsoft 365 E3, Microsoft 365 E5, Microsoft 365 F1, Office 365/Microsoft 365 F3, Microsoft 365 Business Basic, Microsoft 365 Business Standard, Microsoft 365 Business Premium SKU.
- Info: EN: https://support.microsoft.com/en-us/office/record-a-meeting-in-teams-34dfbe7f-b07d-4a27-b4c6-de62f1348c24
b) saving audio/video recording, file will disappear after some time (details:
- EN: https://support.microsoft.com/en-us/office/record-a-meeting-in-teams-34dfbe7f-b07d-4a27-b4c6-de62f1348c24
- Required license: Office 365 Enterprise E1, E3, E5, F3, A1, A3, A5, M365 Business, Business Premium, or Business Essentials.

Google Speech to text api

https://cloud.google.com/speech-to-text#section-12
Speech Recognition (without Data Logging – default): 0-60 Minutes – Free; Over 60 Minutes – $0.024 / minute
Not tested

Web services

none of the services had a convincing transcription to Czech

https://speechtext.ai

180 minutes/10 USD, 990 minutes/49 USD
credit (pay as you go, not monthly payment)

https://www.rev.ai/pricing

0.02 USD/min

Google Recorder (not tested)

on Pixel phones, saving transcription to the cloud on newer phones

Whisper Open.ai model – custom installation

advantages: fast transcription, free
disadvantages: not accurate – requires corrections, no speaker identification (diarization)
speaker identification (diarization) – can be bypassed through additional modifications
for reference: transcription of a 10-minute conversation takes 6 minutes of computational time (on Google hardware), but it should fit into the free tariff

Quality audio is required

for online interviews, I definitely recommend headphones and a microphone, any are better than none.
for live interviews – if we don’t have a professional studio for group interviews – very good recommendations can be found at this link: https://www.indianscribes.com/4-ways-to-improve-focus-group-recordings/

Pay attention to:

voices not overlapping,
letting respondents finish their sentences,
refraining from loudly expressing understanding to the respondent – sticking only to non-verbal expressions (nodding to indicate understanding), even though it can be difficult.

Conclusions:

For online interviews:

MS Teams

For recordings/personal interviews/focus groups/speech-to-text transcription:

MS Word 365
Whisper AI – custom installation
try to have the best possible audio quality.

Credit: Concept writen by human, text writen by human, translated to english by AI/CHatGPT, summary by AI/ChatGPT