Corpus Analysis for Salesforce?

I won't share the Python script I used here, because it is not a pretty thing. But I can describe my approach. At its essence, all I did was look for consecutive word frequency. I played around with the expressions a fair amount and ended up settling on the following:

expression = '.*'.join(['[\w\'@]+'] * word_count)
one word     [\w'@]+
two words    [\w'@]+.*[\w'@]+
three words  [\w'@]+.*[\w'@]+.*[\w'@]+
etc.

I ran up to 7 words, but only found useful data from 1-5.

From this expression, I generated a set of all results for each email body. Then I counted how many emails a given set element appears in for each category. This gave me a basic data structure like:

phrase     support    billing
from       53595      16514
message    41649      15372
your       41493      16534
this       37288      13067

Not super useful. But, I know that support has 91.2k records and billing has 31.2k records, so I can make this a little more valuable by adding the percentages.

phrase     support    billing    support %    billing %
from       53595      16514      58.77%       52.93%
message    41649      15372      45.67%       49.27%
your       41493      16534      45.50%       52.99%
this       37288      13067      40.89%       41.88%

From there, I can deduce the ratio of support %/billing % and vice versa and use this metric to estimate predictive power.

phrase            support %    billing %    support ratio    billing ratio
origin            9.60%        0.47%        20.52            0.05
persons           8.07%        0.70%        11.55            0.09
entities          8.01%        0.66%        12.08            0.08
retransmission    7.99%        0.58%        13.69            0.07
hesitate          0.54%        6.54%        0.08             12.16
postal            0.03%        5.28%        0.01             155.39
postallog         0.00%        5.17%        0                1000

I filtered on everything over 10 for both, but that ended up predicting billing quite poorly. So I increased the threshold to only use a billing ratio of at least 100.

I then used these expressions to categorize the existing data. My expression would just be an or join on all of the predictors, e.g. (?si)(origin|persons|entities). The results:

Category    % Emails Matched    Accuracy
Support     50.3%               89.4%
Billing     7.8%                95.6%
Unmatched   41.9%               0%

Tags:

Parsing

Data