public class SparkImporterUtils extends Object
| Modifier and Type | Method and Description |
|---|---|
<T> scala.collection.Seq<T> |
asSeq(List<T> values)
implemented help method as per https://stackoverflow.com/questions/40741459/scala-collection-seq-doesnt-work-on-java
|
static SparkImporterUtils |
getInstance() |
String |
md5CecksumOfObject(Object obj) |
org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
removeDuplicatedColumns(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> dataset) |
org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
removeEmptyLinesAfterImport(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> dataset)
removes lines with no process instance id
|
void |
writeDatasetToCSV(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> dataSet,
String subDirectory) |
void |
writeDatasetToParquet(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> dataSet,
String subDirectory) |
public static SparkImporterUtils getInstance()
public String md5CecksumOfObject(Object obj) throws IOException, NoSuchAlgorithmException
IOExceptionNoSuchAlgorithmExceptionpublic void writeDatasetToParquet(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> dataSet,
String subDirectory)
public void writeDatasetToCSV(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> dataSet,
String subDirectory)
public org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> removeDuplicatedColumns(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> dataset)
public org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> removeEmptyLinesAfterImport(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> dataset)
dataset - dataset to be cleanedpublic <T> scala.collection.Seq<T> asSeq(List<T> values)
T - Type of objects to be convertedvalues - List to be convert to Scala SeqCopyright © 2018 viadee Unternehmensberatung AG. All rights reserved.