How to abbreviate HTML with Java?

If you really want to abbreviate HTML then just do it (cut the text at desired length), pass the abbreviated result through http://jtidy.sourceforge.net/ and hope for the best.


I don't know any library but it should not be so complicated (for 80%). You only need a simple "parser" that understand 4 type of tokens:

  • opening tags - everything that starts with < but not </ and ends with > but not />
  • closing tags - everything that starts with </ and ends with >
  • self-closing tags (like <br/>) - everything that starts with < but not </ and ends with /> but not >
  • normal character - everything that is none of the other types

Then you must walk through your input string, and count the "normal characters". While you walking along the string and count, you copy every token to the output as long as the counted normal chars are less or equals the amount you want to have.

You also need to build a stack of current open tags, while you walk thought the input. Every time you walk trough a "opening tag" you put it to the stack (its name), every time you you find a closing tag, you remove the topmost tag name from the stack (hopefully the input is correct XHTML).

When you reach the end of the required amount of normal chars, then you only need to write closing HTML tags for the tag names remaining on the stack.

But be careful, this works only with the input is well-formed XML.

I don't know what you want to do with this piece of code, but you should pay attention to HTML/JavaScript injection attacks.


Since it's not very easy to do this correctly I usually strip all tags and truncate. This gives great control on the text size and appearance which usually needs to be placed in places where you do need control.

Note that you may find my proposal very conservative and it actually is not a proper answer to your question. But most of the times the alternatives are:

  • strip all tags and truncate
  • provide an alternate content manageable rich text which will serve as the truncated text. This of course only works in the case of CMSes etc

The reason that truncating HTML would be hard is that you don't know how truncating would affect the structure of the HTML. How would you truncate in the middle of a <ul> or, even worst, in the middle of a complex <table>?

So the problem here is that HTML can not only contain content and styling (bold, italics) but also structure (lists, tables, divs etc). So a good and safe implementation would be to strip everything out apart inline "styling" tags (bold, italics etc) and truncate while keeping track of unclosed tags.