Here's how to use mencoder rip the vobsub subtitles from a DVD or disc image or video_ts folder, and then convert the vobsubs to text .srt if required.
You need the midentify script. You can get it from mplayer's source package TOOLS directory. I've attached it to this post gzipped if that's easier for you. It should be in /usr/bin not /usr/local/bin and must be executable.
First thing to note is that this method is for English subs. These are simplest because, at least on region 1 or 2 English language movies, the English subs are always the first and have subtitle id 0. For other languages you'll have to check each movie/disk individually for your preferred language's sid (subtitle ID), or modify the script to use midentify and grep the language and hence the sid you want.
So to dump the vobsubs I use a script:
Code: Select all
#!/bin/bash
# dumpenglishvobsub.sh
# input must be <devicename> <dvd://#>
# device can be physical device, disk image, video_ts folder
DISKNAME="$(lsdvd "$@" |grep Disc\ Title |awk -F ': ' '{print $2}')"_VOBSUBS
mkdir $DISKNAME
mencoder -dvd-device "$@" -ovc copy -o /dev/null \
-nosound -sid 0 -vobsuboutindex 0 -vobsubout $DISKNAME/0_en
So for example to dump the sub for title 2 of my disc image MYDISC.iso I'd run
Code: Select all
$ dumpenglishvobsub.sh MYDISC.iso dvd://02
and the vobsubs will be dumped to a new folder, named according to the disc title used by the manufacturer. Usually that's the name of the movie but some studios are lazy and it may be a serial number or the unimaginative DVD_VIDEO.
If you want the vobsubs for every language:
Code: Select all
#!/bin/bash
# ripvobsubs.sh
# input must be <devicename> <dvd://#>
# device can be physical device, disk image, video_ts folder etc
# get info about dvd
midentify -dvd-device "$@" >dvdinfo
# set variable:number of subtitle tracks.
# write subtitle lang identifiers to file
NUMSUB=$[$(grep ID\_SID dvdinfo |tee >(cat>subs.txt) |wc -l) -1]
# set variable of dvd diskname_VOBSUBS ##
DISKNAME="$(lsdvd "$@" |grep Disc\ Title |awk -F ': ' '{print $2}')"_VOBSUBS
mkdir $DISKNAME
echo $NUMSUB
for SUBNUM in $(seq 0 $NUMSUB) ; do
# use mencoder to dump vobsubs. one mencoder pass per subtitle.
# runs at about 2000 fps
mencoder -dvd-device "$@" -ovc copy -o /dev/null \
-nosound -sid $SUBNUM -vobsuboutindex 0 -vobsubout \
$DISKNAME/$(grep \_$SUBNUM\_ subs.txt |awk -F '=' '{print $2}')_$SUBNUM
done
Again this will create a folder according to the title name and dump all the vobsubs into it, naming each set according to their language and sid. This is surprisingly fast. I've seen some people recommend to use '-ovc raw' but that's not right. A copy is always faster than any kind of manipulation or extraction. When I compared it for myself I found that using '-ovc raw' means vobsub extraction will run about 5 or 6 times slower than with '-ovc copy'.
If you want to convert your vobsubs to srt it's easy and quite fast, though you'd perhaps not find this to be the case if you follow the subtitleripper/transcode docs or online tutorials which have some errors of omission due to not being updated in line with the applications. You need the package subtitleripper (get it from debian-multimedia) and a spellchecker such as ispell. If you're using a graphical environment then Gaupol makes for a nice spellchecker for subtitles.
Copy the ifo file for the title from your DVD/video_ts/disc image. The ifo file for title 1 would be VTS_01_0.IFO, for title 2 would be VTS_02_0.IFO and so on.
So if your English vobsubs are en_0 (you have a file en_0.idx and a file en_0.sub) you can run
Code: Select all
vobsub2pgm -c 255,255,0,255 -i VTS_01_0.IFO -g 2 en_0 english
The '-g 2' is essential because it ensures the output is gzipped which the next tool pgm2txt now requires (earlier versions didn't expect the input to be gzipped and this new requirement is documented in exactly zero places as far as I can tell). This detail is absent from the docs and from every example I've ever seen. I cursed a lot of people.
Now run
and you'll see gocr (optical character recognition) offer you unrecognised characters to identify. The characters composed of ### signs are the unrecognised ones. If you can't recognise any characters either then go back a step: delete all the output of vobsub2pgm and try it again with a slightly different -c option, so if -c 255,255,0,255 didn't produce useful results then this time try -c 255,0,255,255. Check the man page for more detail, but one or other of these two settings will probably work in most cases. Don't worry this is all easy and quick to do.
So run pgm2txt again and it should just take a couple of minutes and not too many prompts for input. Now you have a file called english.srtx and you can make an srt file
Code: Select all
srttool -s -w < english.srtx > moviename.srt
You should now spellcheck the moviename.srt because it will have some errors. On a console system ispell is decent enough but if you have a graphical desktop then a subtitle tool with built in spellchecker such as Gaupol makes a lot of sense. Probably the spellchecking will be the longest part of this process. When it's done you have an srt subtitle file ready to be merged into mkv or mp4 container or be used alongside a movie in avi container. The timings and spelling and grammar should be correct.
Most tutorials you'll see for this will use transcode's tccat and tcextract for the inital task of dumping the subs. I prefer mencoder because I'm more familiar with it and when I've compared them the initial extraction speed is identical. Overall mencoder is definitely faster because it doesn't require a further step (transcode method requires you to next run subtitle2vobsub).