Gradle Support for GCP Dataflow Templates?

Commandline to Run Cloud Dataflow Job With Gradle

Generic Execution

$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MyPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow" -Pdataflow-runner

Specific Example

$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MySpannerPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow --spannerInstanceId=fooInstance --spannerDatabaseId=barDatabase" -Pdataflow-runner

Explanation of Commandline

  1. gradle clean execute uses the execute task which allows us to easily pass commandline flags to the Dataflow Pipeline. The clean task removes cached builds.

  2. -DmainClass= specifies the Java Main class since we have multiple pipelines in a single folder. Without this, Gradle doesn't know what the Main class is and where to pass the args. Note: Your gradle.build file must include task execute per below.

  3. -Dexec.args= specifies the execution arguments, which will be passed to the Pipeline. Note: Your gradle.build file must include task execute per below.

  4. --runner=DataflowRunner and -Pdataflow-runner ensure that the Google Cloud Dataflow runner is used and not the local DirectRunner

  5. --spannerInstanceId= and --spannerDatabaseId= are just pipeline-specific flags. Your pipeline won't have them so.

build.gradle contents (NOTE: You need to populate your specific dependencies)

apply plugin: 'java'
apply plugin: 'maven'
apply plugin: 'application'

group = 'com.foo.bar'
version = '0.3'

mainClassName = System.getProperty("mainClass")

sourceCompatibility = 1.8
targetCompatibility = 1.8

repositories {

     maven { url "https://repository.apache.org/content/repositories/snapshots/" }
     maven { url "http://repo.maven.apache.org/maven2" }
}

dependencies {
    compile group: 'org.apache.beam', name: 'beam-sdks-java-core', version:'2.5.0'
    // Insert your build deps for your Beam Dataflow project here
    runtime group: 'org.apache.beam', name: 'beam-runners-direct-java', version:'2.5.0'
    runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0'
}

task execute (type:JavaExec) {
    main = System.getProperty("mainClass")
    classpath = sourceSets.main.runtimeClasspath
    systemProperties System.getProperties()
    args System.getProperty("exec.args").split()
}

Explanation of build.gradle

  1. We use the task execute (type:JavaExec) in order to easily pass runtime flags into the Java Dataflow pipeline program. For example, we can specify what the main class is (since we have more than one pipeline in the same folder) and we can pass specific Dataflow arguments (i.e., specific PipelineOptions). more here

  2. The line of build.gradle that reads runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0' is very important. It provides the Cloud Dataflow runner that allows you to execute pipelines in Google Cloud Platform.


There's absolutely nothing stopping you writing your Dataflow application/pipeline in Java, and using Gradle to build it.

Gradle will simply produce an application distribution (e.g. ./gradlew clean distTar), which you then extract and run with the --runner=TemplatingDataflowPipelineRunner --dataflowJobFile=gs://... parameters.

It's just a runnable Java application.

The template and all the binaries will then be uploaded to GCS, and you can execute the pipeline through the console, CLI or even Cloud Functions.

You don't even need to use Gradle. You could just run it locally and the template/binaries will be uploaded. But, I'd imagine you are are using a build server like Jenkins.

Maybe the Dataflow docs should read "Note: Template creation is currently limited to Java", because this feature is not available in the Python SDK yet.