Java: How to extract an HTML tag from a String using Pattern and Matcher

Problem: In a Java program, you want a way to extract a simple HTML tag from a String, and you don't want to use a more complicated approach.

Solution: Use the Java Pattern and Matcher classes, and supply a regular expression (regex) to the Pattern class that defines the tag you want to extract. Then use the find method of the Matcher class to see if there is a match, and if so, use the group method to extract the actual group of characters from the String that matches your regular expression.

In the following source code I demonstrate how to extract the contents from a code tag from a longer HTML string:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * A complete Java program that demonstrates how to
 * extract a tag from a line of HTML using the Pattern
 * and Matcher classes.
 */
public class PatternMatcherGroupHtml {

  public static void main(String[] args) {

    String stringToSearch = "<p>Yada yada yada <code>StringBuffer</code> yada yada ...</p>";

    // the pattern we want to search for
    Pattern p = Pattern.compile("<code>(\\S+)</code>");
    Matcher m = p.matcher(stringToSearch);

    // if we find a match, get the group 
    if (m.find()) {

      // get the matching group
      String codeGroup = m.group(1);
      
      // print the group
      System.out.format("'%s'\n", codeGroup);

    }

  }
}

By using a group to extract the contents between the HTML opening and closing code tags, the output from this program is:

'StringBuffer'

Discussion

In this example, the regex "<code>(\\S+)</code>" lets me extract everything between the opening and closing code tags as a group. I then access this group using this line of code:

String codeGroup = m.group(1);

Finding all matching groups

It’s important to note that this example is hard-coded to look for only one occurrence of this group. In a more robust example, where you want to find and extract the contents of every code tag, your code would look more like this, using a while loop with the find method:

while (m.find())
{
  String codeGroup = m.group(1);
  System.out.format("'%s'\n", codeGroup);
}

This code repeatedly calls the find method and prints the contents of the matching group until find doesn't locate any more matching patterns in the given String.

In summary, if you wanted to see a simple way to extract an HTML regular expression pattern from a String in Java, I hope this is helpful.

Add new comment

Anonymous format

  • Allowed HTML tags: <em> <strong> <cite> <code> <ul type> <ol start type> <li> <pre>
  • Lines and paragraphs break automatically.