How Do You Configure Maven to Use an Apache Parquet Plugin?

Problem scenario
You want to use Maven's Apache Parquet plugin with Hadoop.  How do you use these Apache technologies together?

Solution
1.  Install HDFS.  See this link if you are using Ubuntu.  See this link if you are using a RedHat distribution of Linux. If you have more than one server and want a multi-node cluster of Hadoop, see this link for directions on how to deploy and configure it.

2.  Install maven (e.g., "sudo apt-get -y install maven" or see this post if you are using a RedHat derivative)

3.  Run this command (but feel free to substitute "contint" and "cont-int" with the names of your choice):
mvn archetype:generate -DgroupId=com.contint.app -DartifactId=cont-int -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

4.  Run this command (you'll need the location for steps #5 and #6): sudo find / -name pom.xml

5.  Modify the file found above.  In pom.xml under the lowest </dependency> tab, add these five lines (to create additional dependency element):
   <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-avro</artifactId>
      <version>1.9.0</version> <!-- or latest version -->
   </dependency>

6.  Assuming you are in the directory with the pom.xml file modified above, run mvn package (with no quotes). If you were not in the directory with the pom.xml file, change directories, then run mvn package.

Leave a comment

Your email address will not be published. Required fields are marked *