Get the actual email message that the person just wrote, excluding any quoted text

Unfortunately, you're in for a world of hurt if you want to try to clean up emails meticulously (removing everything that's not part of the actual reply email itself). The ideal way would be to, as you suggest, write up regex for each popular email client/service, but that's a pretty ridiculous amount of work, and I recommend being lazy and dumb about it.

Interestingly enough, even Facebook engineers have trouble with this problem, and Google has a patent on a method for "Detecting quoted text".

There are three solutions you might find acceptable:

Leave It Alone

The first solution is to just leave everything in the message. Most email clients do this, and nobody seems to complain. Of course, online message systems (like Facebook's 'Messages') look pretty odd if they have inception-style replies. One sneaky way to make this work okay is to render the message with any quoted lines collapsed, and include a little link to 'expand quoted text'.

Separate the Reply from the Older Message

The second solution, as you mention, is to put a delineating message at the top of your messages, like --------- please reply above this line ----------, and then strip that line and anything below when processing the replies. Many systems do this, and it's not the worst thing in the world... but it does make your email look more 'automated' and less personal (in my opinion).

Strip Out Quoted Text

The last solution is to simply strip out any new line beginning with a >, which is, presumably, a quoted line from the reply email. Most email clients use this method of indicating quoted text. Here's some regex (in PHP) that would do just that:

$clean_text = preg_replace('/(^\w.+:\n)?(^>.*(\n|$))+/mi', '', $message_body);

There are some problems using this simpler method:

  • Many email clients also allow people to quote earlier emails, and preface those quote lines with > as well, so you'll be stripping out quotes.
  • Usually, there's a line above the quoted email with something like On [date], [person] said. This line is hard to remove, because it's not formatted the same among different email clients, and it may be one or two lines above the quoted text you removed. I've implemented this detection method, with moderate success, in my PHP Imap library.

Of course, testing is key, and the tradeoffs might be worth it for your particular system. YMMV.


There are many libraries out there that can help you extract the reply/signature from a message:

  • Ruby: https://github.com/github/email_reply_parser
  • Python: https://github.com/zapier/email-reply-parser or https://github.com/mailgun/talon
  • JavaScript: https://github.com/turt2live/node-email-reply-parser
  • Java: https://github.com/edlio/EmailReplyParser
  • PHP: https://github.com/willdurand/EmailReplyParser

I've also read that MailGun offers a service to parse inbound email and POST its content to a URL of your choice. It will automatically strip quoted text from your emails: http://blog.mailgun.com/handle-incoming-emails-like-a-pro-mailgun-api-2-0/

Hope that helps!


Possibly helpful: quotequail is a Python library that helps identify quoted text in emails