Trying to KISS

So I’m sure a lot of you have had to parse HTML using all sorts of
String manipulations and most likely in a WebApp. This seems like a
very typical problem with no easy solution given the nature of
hand-written HTML (never a well formed-document to start with). You end
up doing things like:


while ( (stringIndex =
result.toString().toLowerCase().indexOf(myTags[j])) != -1 ) {

   ...
}

which is really buggy by nature.

If you consider a use-case where you’d need to trim some HTML content,
one of the biggest issues is that you don’t want to forget
closing tags so the final document doesn’t inherit
the style from a previously unclosed style tag for instance (a blog
entry in a page comes to mind). At the same time you don’t want visible
HTML tags (such as <br>)
to show after the trimming has occured. You also don’t want to be
indefinitely looking for closing tags that do not exist.

So while you could be using this
or that
HTML parser, you could also be using the
one that comes with Swing
and
that’s been part of every JVM since 2000 and write something much safer
using a single Swing component :


// Constructor
public SilentSwingHTMLTrim() {
   jEditorPane = new javax.swing.JEditorPane();
   jEditorPane.setContentType("text/html");
   for (int i = 0 ; i<entries.length; i++ ) {
     // trim after 200 characters
     String truncatedEntry = doTrim(entries[i], 200);
     System.out.println(truncatedEntry);
   }
}
  
// 5 lines to do the job
private String doTrim(String entry, int trimSize) {
   jEditorPane.setText(entry);
   // Select 'trimSize' characters from the rendered HTML content
   jEditorPane.select(0, trimSize);
   String result = jEditorPane.getSelectedText();
   if ( result.length() == trimSize)
     result += "...";
   // return the selected text
   return result;
}

Of course you never setVisible(true)
the editor pane and all you get is plain text (no link, no formating,
etc…). This is a very basic use of the parser. More advanced use is
described here.
Another issue is that the support HTML level by Swing’s HTML parser is
only 3.2.

So this is no silver bullet technique, just trying to keep things
simple stupid.

Maybe this would work pretty nicely for spell-checking HTML source code…

Author: alexismp

Google Developer Relations in Paris.