Blog Archives

Cleaning up video conference transcripts

5/15/2023

I have had increasing opportunities in recent years to work with video interview transcripts created by video conferencing software. These come out with a lot of inaccuracy, which I think is best managed by close checking. I find more annoying the amount of clutter - names, time stamps and sentences split across lines when a new time stamp is added every time someone pauses.

The sample screenshot above has been altered to substitute a selection of professional cyclists names for real meeting attendees; only in my imagination would I be able to convene this group on a video call to chat about the weather!

I cleaned a few of these by hand, which took several hours before starting to explore options. I calculate that having a text file that is moderately inaccurate, probably saves 1/3 to 2/3 of the time when compared to creating a typed transcript from scratch. But the keystrokes to delete metadata - time and names - and move text - keep this from being optimally efficient. Both Zoom and Teams save the transcript or caption files, by default as vtt. If you save captions locally and are not the meeting manager you might end up with .txt, however. If you do have the file as a .vtt, a really fast option it this online .vtt cleaner.

The downside of the cleaner is that you end up with no spacing, just a continuous paragraph of text. If your aim is to check this against a recording, however, it will give you a pretty good start, although you cannot see where one speaker ends and the next begins, so adding back in this spacing/formatting may add to the clean up time.

I was pretty sure Python was the way to go, and I reached out to statistical consulting at my university. A super helpful consultant provided me with code in the Google Colaboratory.
This was pretty nice because it allowed me to run Python without really understanding what I was doing or having to run the program directly itself. Another advantage of the Colaboratory solution was that, subject to naming an organization, I could clean multiple files at once - as many as I put in the designated folder.
The first thing done in the Google Colab was to transform the .vtt (which is like a plain text file) into a data frame - once I saw that, it all sort of fell into place for me conceptually. I am interested enough to plan to learn to be at least Python competent, if not proficient, over the summer months.

In the end, I identified a far simpler way to pretty much accomplish what I needed.

0 Comments

Cleaning up video conference transcripts

Author

Archives

Categories