Java - extract multiple HTML tags (groups) from a multiline String

Problem: In a Java program, you need a way to find/match a pattern against a multiline String or in a more advanced case, you want to extract one or more groups of regular expressions from a multiline String.

Solution: Use the Java Pattern and Matcher classes, and define the regular expressions (regex) you want to look for when creating your Pattern class. Also, specify the Pattern.MULTILINE flag when creating your Pattern instance. As usual with groups, place your regex definitions inside grouping parentheses so you can extract the actual text that matches your regex patterns from the String.

In the following source code example I demonstrate how to extract the text between the opening and closing HTML code tags from a given multi-line String:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * A complete Java program to demonstrate how to extract multiple
 * HTML tags from a String that contains multiple lines. Multiple
 * lines are handled with the Pattern.MULTILINE flag.
 */
public class PatternMatcherGroupHtmlMultiline
{
  public static void main(String[] args)
  {
    String stringToSearch = "<p>Yada yada yada <code>foo</code> yada yada ...\n"
      + "more here <code>bar</code> etc etc\n"
      + "and still more <code>baz</code> and now the end</p>\n";

    // the pattern we want to search for
    Pattern p = Pattern.compile(" <code>(\\w+)</code> ", Pattern.MULTILINE);
    Matcher m = p.matcher(stringToSearch);

    // print all the matches that we find
    while (m.find())
    {
      System.out.println(m.group(1));
    }

  }
}

The output from this program is:

foo
bar
baz

Discussion

The stringToSearch is created with several newline characters (\n) to simulate the multiple strings you might get when reading a file (or input stream) that contains HTML.

The most important part of the solution involves using the Pattern.MULTILINE flag when creating your Pattern object. As the name implies, this tells the Pattern class to look across multiple lines when parsing the String.

Another important part of the solution is to use a while loop with the find method to make sure you find all occurrences of your regex pattern in the input String. If you only use an if statement with the find method, you will only get the first match.

Add new comment

The content of this field is kept private and will not be shown publicly.

Anonymous format

  • Allowed HTML tags: <em> <strong> <cite> <code> <ul type> <ol start type> <li> <pre>
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.