Convert doc to txt via commandline

You will have to use two different command-line tools, depending if you are working with .doc or .docx format.

For .doc use catdoc:

catdoc foo.doc > foo.txt

For .docx use docx2txt:

docx2txt foo.docx

The latter will produce a file called foo.txt in the same directory as the original.

I'm not sure which Linux distribution you are using, but both catdoc and docx2txt are available from the Ubuntu repositories, for example:

apt-get install docx2txt

Or with Homebrew on Mac:

brew install docx2txt

here is a perl project which claims to do it. I have done a lot of this by hand also, using XSLT on the document.xml. the Docx file itself is just a zip file, you can unzip it and inspect the elements. I will say that this is not hard to do for specific files, but is very hard to do in the general case, because of the lack of documentation for how Word internally stores things, and the variance of internal representation.

Tags:

Linux

Ms Word

Doc