Get group names in java regex

You want to use the small name-regexp library. It is a thin wrapper around java.util.regex with named capture groups support for Java 5 or 6 users.

Sample usage:

Pattern p = Pattern.compile("(?<user>.*)");
Matcher m = p.matcher("JohnDoe");
System.out.println(m.namedGroups()); // {user=JohnDoe}

Maven:

<dependency>
  <groupId>com.github.tony19</groupId>
  <artifactId>named-regexp</artifactId>
  <version>0.2.3</version>
</dependency>

References:

  • name-regexp 0.2.5
  • Matcher#namedGroups

I used a pattern of groups of regex into the "real" pattern to get the names of the groups, like that:

        List<String> namedGroups = new ArrayList<String>();
    {
        String normalized = matcher.pattern().toString();
        Matcher mG = Pattern.compile("\\(\\?<(.+?)>.*?\\)").matcher(normalized);

        while (mG.find()) {
            for (int i = 1; i <= mG.groupCount(); i++) {
                namedGroups.add(mG.group(i));
            }
        }
    }

And then, I added the names with the values into a HashMap<String, String>:

        Map<String, String> map = new HashMap<String, String>(matcher.groupCount());
        
        namedGroups.stream().forEach(name -> {      
            if (matcher.start(name) > 0) {
                map.put(name, matcher.group(name));
            } else {
                map.put(name, "");
            }
        });

This is the second easy approach to the problem: we will call the non-public method namedGroups() in Pattern class to obtain a Map<String, Integer> that maps group names to the group numbers via Java Reflection API. The advantage of this approach is that we don't need a string that contains a match to the regex to find the exact named groups.

Personally, I think it is not much of an advantage, since it is useless to know the named groups of a regex where a match to the regex does not exist among the input strings.

However, please take note of the drawbacks:

  • This approach may not apply if the code is run in a system with security restrictions to deny any attempts to gain access to non-public methods (no modifier, protected and private methods).
  • The code is only applicable to JRE from Oracle or OpenJDK.
  • The code may also break in future releases, since we are calling a non-public method.
  • There may also be performance hit from calling function via reflection. (In this case, the performance hit mainly comes from the reflection overhead, since there is not much going on in namedGroups() method). I do not know how the performance hit affects overall performance, so please do measurement on your system.

import java.util.Collections;
import java.util.Map;
import java.util.Scanner;
import java.util.regex.Pattern;

import java.lang.reflect.Method;
import java.lang.reflect.InvocationTargetException;

class RegexTester {
  public static void main(String args[]) {
    Scanner scanner = new Scanner(System.in);

    String regex = scanner.nextLine();
    // String regex = "(?<group>[a-z]*)[trick(?<nothing>ha)]\\Q(?<quoted>Q+E+)\\E(.*)(?<Another6group>\\w+)";
    Pattern p = Pattern.compile(regex);

    Map<String, Integer> namedGroups = null;
    try {
      namedGroups = getNamedGroups(p);
    } catch (Exception e) {
      // Just an example here. You need to handle the Exception properly
      e.printStackTrace();
    }

    System.out.println(namedGroups);
  }


  @SuppressWarnings("unchecked")
  private static Map<String, Integer> getNamedGroups(Pattern regex)
      throws NoSuchMethodException, SecurityException,
             IllegalAccessException, IllegalArgumentException,
             InvocationTargetException {

    Method namedGroupsMethod = Pattern.class.getDeclaredMethod("namedGroups");
    namedGroupsMethod.setAccessible(true);

    Map<String, Integer> namedGroups = null;
    namedGroups = (Map<String, Integer>) namedGroupsMethod.invoke(regex);

    if (namedGroups == null) {
      throw new InternalError();
    }

    return Collections.unmodifiableMap(namedGroups);
  }
}

There is no API in Java to obtain the names of the named capturing groups. I think this is a missing feature.

The easy way out is to pick out candidate named capturing groups from the pattern, then try to access the named group from the match. In other words, you don't know the exact names of the named capturing groups, until you plug in a string that matches the whole pattern.

The Pattern to capture the names of the named capturing group is \(\?<([a-zA-Z][a-zA-Z0-9]*)> (derived based on Pattern class documentation).

(The hard way is to implement a parser for regex and get the names of the capturing groups).

A sample implementation:

import java.util.Scanner;
import java.util.Set;
import java.util.TreeSet;
import java.util.Iterator;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.regex.MatchResult;

class RegexTester {

    public static void main(String args[]) {
        Scanner scanner = new Scanner(System.in);

        String regex = scanner.nextLine();
        StringBuilder input = new StringBuilder();
        while (scanner.hasNextLine()) {
            input.append(scanner.nextLine()).append('\n');
        }

        Set<String> namedGroups = getNamedGroupCandidates(regex);

        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(input);
        int groupCount = m.groupCount();

        int matchCount = 0;

        if (m.find()) {
            // Remove invalid groups
            Iterator<String> i = namedGroups.iterator();
            while (i.hasNext()) {
                try {
                    m.group(i.next());
                } catch (IllegalArgumentException e) {
                    i.remove();
                }
            }

            matchCount += 1;
            System.out.println("Match " + matchCount + ":");
            System.out.println("=" + m.group() + "=");
            System.out.println();
            printMatches(m, namedGroups);

            while (m.find()) {
                matchCount += 1;
                System.out.println("Match " + matchCount + ":");
                System.out.println("=" + m.group() + "=");
                System.out.println();
                printMatches(m, namedGroups);
            }
        }
    }

    private static void printMatches(Matcher matcher, Set<String> namedGroups) {
        for (String name: namedGroups) {
            String matchedString = matcher.group(name);
            if (matchedString != null) {
                System.out.println(name + "=" + matchedString + "=");
            } else {
                System.out.println(name + "_");
            }
        }

        System.out.println();

        for (int i = 1; i < matcher.groupCount(); i++) {
            String matchedString = matcher.group(i);
            if (matchedString != null) {
                System.out.println(i + "=" + matchedString + "=");
            } else {
                System.out.println(i + "_");
            }
        }

        System.out.println();
    }

    private static Set<String> getNamedGroupCandidates(String regex) {
        Set<String> namedGroups = new TreeSet<String>();

        Matcher m = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>").matcher(regex);

            while (m.find()) {
                namedGroups.add(m.group(1));
            }

            return namedGroups;
        }
    }
}

There is a caveat to this implementation, though. It currently doesn't work with regex in Pattern.COMMENTS mode.

Tags:

Java

Regex