How to extract closed caption transcript from YouTube video?

Here's how to get the transcript of a YouTube video (when available):

Go to YouTube and open the video of your choice.
Click on the "More actions" button (3 horizontal dots) located next to the Share button.
Click "Open transcript"

Although the syntax may be a little goofy this is a pretty good solution.

Source: http://ccm.net/faq/40644-youtube-how-to-get-the-transcript-of-a-video

Get timedtext file directly from YouTube

curl -s "$youtube_url_or_id"|grep -o '"baseUrl":"https://www.youtube.com/api/timedtext[^"]*lang=en'|cut -d \" -f4|sed 's/\\u0026/\&/g'|xargs curl -Ls|grep -o '<text[^<]*</text>'|sed -E 's/<text start="([^"]*)".*>(.*)<.*/\1 \2/'|sed 's/\xc2\xa0/ /g;s/&/\&/g'|recode xml|awk '{$1=sprintf("%02d:%02d:%02d",$1/3600,$1%3600/60,$1%60)}1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1

yt-dlp

yt-dlp supports saving the automaticaly generated closed captions in a JSON format:

cap()(printf %s\\n "${@-$(cat)}"|parallel -j10 -q yt-dlp -i --skip-download --write-auto-sub --sub-format json3 -o '%(upload_date)s.%(title)s.%(uploader)s.%(id)s.%(ext)s' --;for f in *.json3;do jq -r '.events[]|select(.segs and .segs[0].utf8!="\n")|(.tStartMs|tostring)+" "+([.segs[]?.utf8]|join(""))' "$f"|awk '{x=$1/1e3;$1=sprintf("%02d:%02d:%02d",x/3600,x%3600/60,x%60)}1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1 >"${f%.json3}";rm "$f";done)

You can also use the function above to download the captions for all videos on a channel or playlist if you give the ID or URL of the channel or playlist as an argument. When there is an error downloading a single video, the -i (--ignore-errors) option skips the video instead of exiting with an error.

youtube-dl

As of 2022, the format of the VTT and TTML downloaded by youtube-dl --write-auto-sub is messed up so that all subtitle texts are placed under a few long lines so you can't see the timestamps of the subtitles. If you don't need the timestamps, then it shouldn't matter, but otherwise you can fix it by substituting yt-dlp for youtube-dl in the following commands. But with yt-dlp, you can also use a more convenient JSON format, so you don't need the following approach to deal with the VTT subtitle format.

This downloads the subtitles as VTT:

youtube-dl --skip-download --write-auto-sub $youtube_url

The other available formats are ttml, srv3, srv2, and srv1 (shown by --list-subs):

--write-sub
       Write subtitle file

--write-auto-sub
       Write automatically generated subtitle file (YouTube only)

--all-subs
       Download all the available subtitles of the video

--list-subs
       List all available subtitles for the video

--sub-format FORMAT
       Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"

--sub-lang LANGS
       Languages of the subtitles to download (optional) separated by commas, use --list-subs for available language tags

You can use ffmpeg to convert the subtitle file to another format:

ffmpeg -i input.vtt output.srt

In the VTT subtitles, each subtitle text is repeated three times, and there is typically a new subtitle text every eighth line (but under some mysterious circumstances it's every 12th line instead):

WEBVTT
Kind: captions
Language: en

00:00:01.429 --> 00:00:04.249 align:start position:0%

ladies<00:00:02.429><c> and</c><00:00:02.580><c> gentlemen</c><c.colorE5E5E5><00:00:02.879><c> I'd</c></c><c.colorCCCCCC><00:00:03.870><c> like</c></c><c.colorE5E5E5><00:00:04.020><c> to</c><00:00:04.110><c> thank</c></c>

00:00:04.249 --> 00:00:04.259 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
 </c>

00:00:04.259 --> 00:00:05.930 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
you<00:00:04.440><c> for</c><00:00:04.620><c> coming</c><00:00:05.069><c> tonight</c><00:00:05.190><c> especially</c></c><c.colorCCCCCC><00:00:05.609><c> at</c></c>

00:00:05.930 --> 00:00:05.940 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
 </c>

00:00:05.940 --> 00:00:07.730 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
such<00:00:06.180><c> short</c><00:00:06.690><c> notice</c></c>

00:00:07.730 --> 00:00:07.740 align:start position:0%
such short notice


00:00:07.740 --> 00:00:09.620 align:start position:0%
such short notice
I'm<00:00:08.370><c> sure</c><c.colorE5E5E5><00:00:08.580><c> mr.</c><00:00:08.820><c> Irving</c><00:00:09.000><c> will</c><00:00:09.120><c> fill</c><00:00:09.300><c> you</c><00:00:09.389><c> in</c><00:00:09.420><c> on</c></c>

00:00:09.620 --> 00:00:09.630 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
 </c>

00:00:09.630 --> 00:00:11.030 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
the<00:00:09.750><c> circumstances</c><00:00:10.440><c> that's</c><00:00:10.620><c> brought</c><00:00:10.920><c> us</c></c>

00:00:11.030 --> 00:00:11.040 align:start position:0%
<c.colorE5E5E5>the circumstances that's brought us
 </c>

This converts the VTT subtitles to a simpler format:

sed '1,/^$/d' *.vtt| # remove the lines at the top of the file
sed 's/<[^>]*>//g'| # remove tags
awk -F. 'NR%4==1{printf"%s ",$1}NR%4==3' | # print each new subtitle text and its start time without milliseconds
awk NF\>1 # remove lines with only one field

Output:

00:00:01 ladies and gentlemen I'd like to thank
00:00:04 you for coming tonight especially at
00:00:05 such short notice
00:00:07 I'm sure mr. Irving will fill you in on
00:00:09 the circumstances that's brought us

In maybe around 10% of videos that I tested with (like for example p9M3shEU-QM and aE05_REXnBc), there were one or more subtitle texts which came 12 and not 8 lines after the previous subtitle text. My workaround is to print every fourth line but to then remove empty lines that have only one field.

Function form:

cap()(printf %s\\n "${@-$(cat)}"|parallel -j10 -q youtube-dl -i --skip-download --write-auto-sub -o '%(upload_date)s.%(title)s.%(uploader)s.%(id)s.%(ext)s' --;for f in *.vtt;do sed '1,/^$/d' -- "$f"|sed 's/<[^>]*>//g'|awk -F. 'NR%4==1{printf"%s ",$1}NR%4==3'|awk 'NF>1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1 >"${f%.vtt}";rm "$f";done)

How to extract closed caption transcript from YouTube video?

Get timedtext file directly from YouTube

yt-dlp

youtube-dl

Tags:

Youtube

Caption

Related

Recent Posts