gawk or grep: single line and ungreedy

Using any POSIX awk in any shell on every UNIX box:

$ cat tst.awk
/[[:space:]]*class[[:space:]]*/ {
    inDef = 1
    fname = FILENAME
    sub(".*/","",fname)
    def = out = ""
}
inDef {
    out = out fname ":" FNR ": " $0 ORS

    # Remove comments (not perfect but should work for 99.9% of cases)
    sub("//.*","")
    gsub("/[*]|[*]/","\n")
    gsub(/\n[^\n]*\n/,"")

    def = def $0 ORS
    if ( /{/ ) {
        if ( gsub(/,/,"&",def) > 2 ) {
            printf "%s", out
        }
        inDef = 0
    }
}

$ find tmp -type f -name '*.java' -exec awk -f tst.awk {} +
multiple-lines.java:1: class ClazzA<R extends A,
multiple-lines.java:2:     S extends B<T>, T extends C<T>,
multiple-lines.java:3:     U extends D, W extends E,
multiple-lines.java:4:     X extends F, Y extends G, Z extends H>
multiple-lines.java:5:     extends OtherClazz<S> implements I<T> {
single-line.java:1: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

The above was run using this input:

$ head tmp/*
==> tmp/X-no-parameter.java <==
class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {

  public void method(Type<A, B> x) {
    // ... code ...
  }
}

==> tmp/X-one-parameter.java <==
class ClazzD<R extends A>  // only one type parameter
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/X-two-line-parameters.java <==
class ClazzF<R extends A,  // only two type parameters
    S extends B<T>>        // on two lines
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/X-two-parameters.java <==
class ClazzE<R extends A, S extends B<T>>  // only two type parameters
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/multiple-lines.java <==
class ClazzA<R extends A,
    S extends B<T>, T extends C<T>,
    U extends D, W extends E,
    X extends F, Y extends G, Z extends H>
    extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) {
    // ... code ...
  }
}

==> tmp/single-line.java <==
class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) {
    // ... code ...
  }
}

The above is just a best effort without writing a parser for the language and just having the OPs posted sample input/output to go on for what needs to be handled.


Note: Presence of comments can cause these solutions to fail.

With ripgrep (https://github.com/BurntSushi/ripgrep)

rg -nU --no-heading '(?s)class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java
  • -n enables line numbering (this is the default if output is to the terminal)
  • -U enables multiline matching
  • --no-heading by default, ripgrep displays matching lines grouped under filename as a header, this option makes ripgrep behave like GNU grep with filename prefix for each output line
  • [^{]* is used instead of .* to prevent matching , and > elsewhere in the file, otherwise lines like public void method(Type<Q, R> x) { will get matched
  • -m option can be used to limit number of matches per input file, which will give an additional benefit of not having to search entire input file

If you use the above regexp with GNU grep, note that:

  • grep matches only one line at a time. If you use -z option, grep will consider ASCII NUL as the record separator, which effectively gives you ability to match across multiple lines, assuming input doesn't have NUL characters that can prevent such matching. Another effect of -z option is that NUL character will be appended to each output result (this could be fixed by piping results to tr '\0' '\n')
  • -o option will be needed to print only matching portion, which means you won't be able to get line number prefix
  • for the given task, -P isn't needed, grep -zoE 'class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java | tr '\0' '\n' will give you similar result as the ripgrep command. But, you won't get line number prefix, filename prefix will be only for each matching portion instead of each matching line and you won't get rest of line before class and after {