Tools to extract text from powerpoint pptx in linux?

If you can process the files in bash, this one-liner will unpack all the text:

unzip -qc "$1" ppt/slides/slide*.xml | grep -oP '(?<=\<a:t\>).*?(?=\</a:t\>)'

Just pass it the pptx file as $1, and it will write the text into file $2. The content of each slide will not appear in presentation order, and there will be no labels or anything, so you'll need a few more lines of script and a temp directory to get a more readable listing.


Since you have Abiword installed you can just make a PDF first

libreoffice --headless --convert-to pdf filename.pptx

And then use abiword to convert the pdf to txt

abiword --to=txt filename.pdf