public abstract class HadoopFsRelation extends BaseRelation implements org.apache.spark.sql.execution.FileRelation, Logging
BaseRelation
that provides much of the common code required for relations that store their
data to an HDFS compatible filesystem.
For the read path, similar to PrunedFilteredScan
, it can eliminate unneeded columns and
filter using selected predicates before producing an RDD containing all matching tuples as
Row
objects. In addition, when reading from Hive style partitioned tables stored in file
systems, it's able to discover partitioning information from the paths of input directories, and
perform partition pruning before start reading the data. Subclasses of HadoopFsRelation()
must override one of the four buildScan
methods to implement the read path.
For the write path, it provides the ability to write to both non-partitioned and partitioned tables. Directory layout of the partitioned tables is compatible with Hive.
Modifier and Type | Class and Description |
---|---|
static class |
HadoopFsRelation.FakeFileStatus |
static class |
HadoopFsRelation.FakeFileStatus$ |
Constructor and Description |
---|
HadoopFsRelation() |
HadoopFsRelation(scala.collection.immutable.Map<java.lang.String,java.lang.String> parameters) |
Modifier and Type | Method and Description |
---|---|
RDD<Row> |
buildScan(org.apache.hadoop.fs.FileStatus[] inputFiles)
For a non-partitioned relation, this method builds an
RDD[Row] containing all rows within
this relation. |
RDD<Row> |
buildScan(java.lang.String[] requiredColumns,
org.apache.hadoop.fs.FileStatus[] inputFiles)
For a non-partitioned relation, this method builds an
RDD[Row] containing all rows within
this relation. |
RDD<Row> |
buildScan(java.lang.String[] requiredColumns,
Filter[] filters,
org.apache.hadoop.fs.FileStatus[] inputFiles)
For a non-partitioned relation, this method builds an
RDD[Row] containing all rows within
this relation. |
protected scala.collection.mutable.LinkedHashSet<org.apache.hadoop.fs.FileStatus> |
cachedLeafStatuses() |
abstract StructType |
dataSchema()
Specifies schema of actual data files.
|
java.lang.String[] |
inputFiles() |
static org.apache.hadoop.fs.FileStatus[] |
listLeafFiles(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.FileStatus status) |
static scala.collection.mutable.LinkedHashSet<org.apache.hadoop.fs.FileStatus> |
listLeafFilesInParallel(java.lang.String[] paths,
org.apache.hadoop.conf.Configuration hadoopConf,
SparkContext sparkContext) |
StructType |
partitionColumns()
Partition columns.
|
abstract java.lang.String[] |
paths()
Paths of this relation.
|
abstract OutputWriterFactory |
prepareJobForWrite(org.apache.hadoop.mapreduce.Job job)
Prepares a write job and returns an
OutputWriterFactory . |
StructType |
schema()
Schema of this relation.
|
static boolean |
shouldFilterOut(java.lang.String pathName)
Checks if we should filter out this path name.
|
long |
sizeInBytes()
Returns an estimated size of this relation in bytes.
|
java.lang.String |
toString() |
scala.Option<StructType> |
userDefinedPartitionColumns()
Optional user defined partition columns.
|
needConversion, sqlContext, unhandledFilters
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
public HadoopFsRelation()
public HadoopFsRelation(scala.collection.immutable.Map<java.lang.String,java.lang.String> parameters)
public static boolean shouldFilterOut(java.lang.String pathName)
public static org.apache.hadoop.fs.FileStatus[] listLeafFiles(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.FileStatus status)
public static scala.collection.mutable.LinkedHashSet<org.apache.hadoop.fs.FileStatus> listLeafFilesInParallel(java.lang.String[] paths, org.apache.hadoop.conf.Configuration hadoopConf, SparkContext sparkContext)
public java.lang.String toString()
toString
in class java.lang.Object
protected scala.collection.mutable.LinkedHashSet<org.apache.hadoop.fs.FileStatus> cachedLeafStatuses()
public abstract java.lang.String[] paths()
public java.lang.String[] inputFiles()
inputFiles
in interface org.apache.spark.sql.execution.FileRelation
public long sizeInBytes()
BaseRelation
Note that it is always better to overestimate size than underestimate, because underestimation could lead to execution plans that are suboptimal (i.e. broadcasting a very large table).
sizeInBytes
in class BaseRelation
public final StructType partitionColumns()
userDefinedPartitionColumns
or automatically
discovered. Note that they should always be nullable.
public scala.Option<StructType> userDefinedPartitionColumns()
public StructType schema()
dataSchema
and all partition
columns not appearing in dataSchema
.
schema
in class BaseRelation
public abstract StructType dataSchema()
dataSchema
.
public RDD<Row> buildScan(org.apache.hadoop.fs.FileStatus[] inputFiles)
RDD[Row]
containing all rows within
this relation. For partitioned relations, this method is called for each selected partition,
and builds an RDD[Row]
containing all rows within that single partition.
inputFiles
- For a non-partitioned relation, it contains paths of all data files in the
relation. For a partitioned relation, it contains paths of all data files in a single
selected partition.
public RDD<Row> buildScan(java.lang.String[] requiredColumns, org.apache.hadoop.fs.FileStatus[] inputFiles)
RDD[Row]
containing all rows within
this relation. For partitioned relations, this method is called for each selected partition,
and builds an RDD[Row]
containing all rows within that single partition.
requiredColumns
- Required columns.inputFiles
- For a non-partitioned relation, it contains paths of all data files in the
relation. For a partitioned relation, it contains paths of all data files in a single
selected partition.
public RDD<Row> buildScan(java.lang.String[] requiredColumns, Filter[] filters, org.apache.hadoop.fs.FileStatus[] inputFiles)
RDD[Row]
containing all rows within
this relation. For partitioned relations, this method is called for each selected partition,
and builds an RDD[Row]
containing all rows within that single partition.
requiredColumns
- Required columns.filters
- Candidate filters to be pushed down. The actual filter should be the conjunction
of all filters
. The pushed down filters are currently purely an optimization as they
will all be evaluated again. This means it is safe to use them with methods that produce
false positives such as filtering partitions based on a bloom filter.inputFiles
- For a non-partitioned relation, it contains paths of all data files in the
relation. For a partitioned relation, it contains paths of all data files in a single
selected partition.
public abstract OutputWriterFactory prepareJobForWrite(org.apache.hadoop.mapreduce.Job job)
OutputWriterFactory
. Client side job preparation can
be put here. For example, user defined output committer can be configured here
by setting the output committer class in the conf of spark.sql.sources.outputCommitterClass.
Note that the only side effect expected here is mutating job
via its setters. Especially,
Spark SQL caches BaseRelation
instances for performance, mutating relation internal states
may cause unexpected behaviors.
job
- (undocumented)