Reliable way to only get the email text, excluding previous emails

I think this should work

import re
string_list = re.findall(r"\w+\s+\w+[,]\s+\w+\s+\d+[,]\s+\d+\s+\w+\s+\d+[:]\d+\s+\w+.*", strings) # regex for On Thu, Mar 24, 2011 at 3:51 PM
res = strings.split(string_list[0]) # split on that match
print(res[0]) # get before string of the regex

The answer @LAMRIN TAWSRAS gave will work for parsing the text before the Gmail date expression only if a match is found, otherwise an error will be thrown. Also, there isn't a need to search the entire message for multiple date expressions, you just need the first one found. Therefore, I would refine his solution to use re.search():

def get_body_before_gmail_reply_date(msg):
  body_before_gmail_reply = msg
  # regex for date format like "On Thu, Mar 24, 2011 at 3:51 PM"
  matching_string_obj = re.search(r"\w+\s+\w+[,]\s+\w+\s+\d+[,]\s+\d+\s+\w+\s+\d+[:]\d+\s+\w+.*", msg)
  if matching_string_obj:
    # split on that match, group() returns full matched string
    body_before_gmail_reply_list = msg.split(matching_string_obj.group())
    # string before the regex match, so the body of the email
    body_before_gmail_reply = body_before_gmail_reply_list[0]
  return body_before_gmail_reply

The formatting of email replies depend on the clients. There is no realiable way to extract the newest message without the risk of removing too much or not enough.

However, a common way to mark quotes is by prefixing them with > so lines starting with that character - especially if there are multiple at the very end or beginning of the email - are likely to be quotes.

But the On Thu, Mar 24, 2011 at 3:51 PM, <[email protected]> wrote: from your example is hard to extract. A line ending with a : right before a quote might indicate that it belongs to the quote, you cannot know that for sure - it could also be part of the new message and the colon is just a typo'd . (on german keyboards : is SHIFT+.).