HTML/XML Tag Parsing using Regex in Java

This time, I am going to show how to parse the html tags like : 
xx <tag a ="b" c=  'd' e=f> yy </tag> zz

(Did you noticed the single,double and no-quote attribute values and spaces ? It is important to consider all these variations.)



Here, we will use the Captured Text of a Group within a Pattern to dynamically match tag name from the start tag and use it in the end tag. eg. matching <tag ...> ... </tag>

First we extract the tag name and attribute sets. For this we use regex : <(\S+?)(.*?)>(.*?)</\1> .

Here   ,
  • </\1> represents the first captured group (\S+?) i.e., tag name.  
  •  (.*?) represents the attributes.
  •    Next (.*?) represents the content inside <> …. </>


Once we find the attributes, we need to extract the (name,value) of each attribute. For this we can use regex (\w+)="(.*?)" for simplicity. But this only matches attribute=”value”- without spaces and only double quotes. For matching attribute, value representations such as a ="b" c=  'd' e=f, we can use the regex ([\w: \-]+)(\s*=\s*("(.*?)"|'(.*?)'|([^ ]*))|(\s+|\z)).


Here is the complete CODE:

     String testHtml = "xx <tag a =\"b\" c=  \'d\' e=f> contentssss </tag> zz";
     Pattern tagPattern = Pattern.compile("<(\\S+?)(.*?)>(.*?)</\\1>");
     Pattern attValueDoubleQuoteOnly = Pattern.compile("(\\w+)=\"(.*?)\"");
     Pattern attValueAll = Pattern.compile("([\\w:\\-]+)(\\s*=\\s*(\"(.*?)\"|'(.*?)'|([^ ]*))|(\\s+|\\z))");
     Matcher m = tagPattern.matcher(testHtml);
     boolean tagFound = m.find(); // true
     String tagOnly = m.group(0);// <tag a ="b" c= 'd' e=f> contentssss </tag>
     String tagname = m.group(1);// tag
     String attributes = m.group(2);// a ="b" c= 'd' e=f
     String content = m.group(3);// contentssss
     System.out.println("Tag Only   : " + tagOnly);
     System.out.println("Tag Name   : " + tagname);
     System.out.println("Attributes : " + attributes);
     System.out.println("Content    : " + content);
     //m = attValueDoubleQuoteOnly.matcher(attributes);
     m = attValueAll.matcher(attributes);
     while (m.find()) {
           System.out.println(" >> " + m.group(0));
     }
Result :
Tag Only   : <tag a ="b" c=  'd' e=f> contentssss </tag>
Tag Name   : tag
Attributes :  a ="b" c=  'd' e=f
Content    :  contentssss
 >> a ="b"
 >> c=  'd'
 >> e=f

See also : Java : Html form parser return map of (name,value) pair of input attribute