NLP reverse tokenizing (going from tokens to nicely formatted sentence)

You can use nltk to some extent for detokenization like this. You'll need to do some post processing or modify the regexes, but here is a sample idea:

import re
from nltk.tokenize.treebank import TreebankWordDetokenizer as Detok
detokenizer = Detok()
text = detokenizer.detokenize(tokens)
text = re.sub('\s*,\s*', ', ', text)
text = re.sub('\s*\.\s*', '. ', text)
text = re.sub('\s*\?\s*', '? ', text)

There are more edge cases with punctuations, but this is pretty simple and slightly better than ' '.join


Within spaCy you can always reconstruct the original string using ''.join(token.text_with_ws for token in doc). If all you have is a list of strings, there's not really a good deterministic solution. You could train a reverse model or use some approximate rules. I don't know a good general purpose implementation of this detokenize() function.

Tags:

Python

Nlp

Spacy