WhisperX Transcription + Diarization Audio Processing for Researchers

This repository contains a Jupyter notebook for qualitative researchers to transcribe, diarize speakers, and convert audio or video files into various text formats (csv, txt, json, & vtt). The notebook uses advanced transcription and diarization capabilities provided by Whisper and WhisperX, as well as pyannote speaker-diarization-3.1 and segmentation-3.0 libraries from Hugging Face*.

*A free Hugging Face token is required specifically for the diarization aspects. The code will not work without it.

The code is derived and built from the following Medium article

To be clear, I am a VERY novice programmer, and much of this work has been done in collaboration with ChatGPT. I am a PhD student with a focus on equity in STEM education as well as educational technology, and I am always in need of better ways of transcribing the tons of audio/video data that we collect. I have been doing a lot of work using natural language processing tools for text analytics and pattern detection (Laura K. Nelson, Computational Grounded Theory) recently so I have fallen down the rabbit hole of producing Jupyter notebooks and decided to make one for WhisperX transcription. I am also a gamer so I happen to have an Nvidia 3090 GPU on my home pc as well as a 4090 gpu in our research lab. The tools I used before were difficult to work with and didn't output in the way that I wanted. So one of my side projects has been trying to redesign the code base WhisperX code found at the site above, so that it is much more useful for generating transcriptions for researchers like myself and the others I work with. I'm certain there are still improvements to be made, but it has worked for us thus far.

This means:

I wanted the ability to do batch transcriptions of audio files found in multiple subdirectories.
I wanted to take advantage of WhisperX's word level time stamping.
Utilize pyannote's speaker diarization capabilities.
Generate csv, txt, json, and vtt files for each audio file transcribe.
Ability to anonymize specific names and places during transcription.

Example CSV output

What This Code Does

Device and Configuration Setup: Sets up the device (GPU or CPU) and other configuration variables like batch size, compute type, and model type.
Library Imports: Imports necessary libraries including PyTorch, WhisperX, and others for handling audio files, text processing, and file I/O.
Path and File Type Setup: Defines paths to your audio files and output directories and specifies the types of audio files to process.
Pseudonym Loading: Loads a CSV file containing pseudonyms for anonymizing transcripts.
Audio Processing Functions: Includes functions to find audio files, get file modification dates, anonymize text, convert segments to different formats, and process each audio file.
Main Function Execution: Finds all audio files in the specified directory, processes them, and saves the transcripts in multiple formats (CSV, TXT, JSON, VTT).

How to Use This Code

1. Set Up Your Environment

Install Required Libraries: Make sure you have all the required libraries installed. You can install them using pip:
```
pip install os pandas torch whisperx gc datetime json webvtt srt
```

2. Configure Paths and Settings

Paths: Replace the placeholder paths with your actual directories:

base_dir = 'Data/RawAudioFiles_Inputs'       # Path to your main folder containing subfolders with audio files
output_base_dir = 'Data/Trancripts_Outputs'  # Path to the folder where you want to save the transcripts

3. Audio File Types

Update the file types if your audio files are in different formats:

file_type1 = '.wav'
file_type2 = '.mp3'
file_type3 = '.ogg'

4. Prepare Pseudonyms CSV

Pseudonyms CSV: Ensure you have a CSV file named pseudonyms.csv in the data directory. This file should contain columns name and pseudonym for anonymizing the transcripts.

5. Execute the set-up code

The main function finds all audio files in the specified directory, processes them, and saves the transcripts. To run the code, simply execute the script.

6. Execute the transcription and diarization functions

7. Check the Outputs

Output Files: The transcripts will be saved in the specified output directory in multiple formats: CSV, TXT, JSON, and VTT

Conclusion

This code is designed to make it easy to process and transcribe large batches of audio or video files while ensuring anonymity through pseudonymization. Happy transcribing!# WhisperXTranscription4Researchers

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Data		Data
.gitignore		.gitignore
01_whisperXTranscription4Researchers.ipynb		01_whisperXTranscription4Researchers.ipynb
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
image.png		image.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WhisperX Transcription + Diarization Audio Processing for Researchers

Example CSV output

What This Code Does

How to Use This Code

1. Set Up Your Environment

2. Configure Paths and Settings

3. Audio File Types

4. Prepare Pseudonyms CSV

5. Execute the set-up code

6. Execute the transcription and diarization functions

7. Check the Outputs

Conclusion

About

Releases

Packages

Languages

License

mrhallonline/WhisperXTranscription4Researchers

Folders and files

Latest commit

History

Repository files navigation

WhisperX Transcription + Diarization Audio Processing for Researchers

Example CSV output

What This Code Does

How to Use This Code

1. Set Up Your Environment

2. Configure Paths and Settings

3. Audio File Types

4. Prepare Pseudonyms CSV

5. Execute the set-up code

6. Execute the transcription and diarization functions

7. Check the Outputs

Conclusion

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages