public class SortMemoryManager extends Object
First, it is necessary to differentiate two concepts:
As a result, we can never be sure of the amount of memory needed for a batch. So, we have to estimate based on a number of factors:
RecordBatchSizer to estimate the data size and
buffer size of each incoming batch.The sort has two key configured parameters: the spill file size and the size of the output (downstream) batch. The spill file size is chosen to be large enough to ensure efficient I/O, but not so large as to overwhelm any one spill directory. The output batch size is chosen to be large enough to amortize the per-batch overhead over the maximum number of records, but not so large as to overwhelm downstream operators. Setting these parameters is a judgment call.
Under limited memory, the above sizes may be too big for the space available. For example, the default spill file size is 256 MB. But, if the sort is only given 50 MB, then spill files will be smaller. The default output batch size is 16 MB, but if the sort is given only 20 MB, then the output batch must be smaller. The low memory logic starts with the memory available and works backwards to figure out spill batch size, output batch size and spill file size. The sizes will be smaller than optimal, but as large as will fit in the memory provided.
| Modifier and Type | Class and Description |
|---|---|
static class |
SortMemoryManager.BatchSizeEstimate |
static class |
SortMemoryManager.MergeAction |
static class |
SortMemoryManager.MergeTask |
| Modifier and Type | Field and Description |
|---|---|
static double |
BUFFER_FROM_PAYLOAD
Given a data size, this is the multiplier to create the buffer
size estimate.
|
static double |
INTERNAL_FRAGMENTATION_ESTIMATE
Estimate for typical internal fragmentation in a buffer due to power-of-two
rounding on vectors.
|
static double |
LOW_MEMORY_MERGE_BATCH_RATIO |
static int |
MIN_ROWS_PER_SORT_BATCH
Desperate attempt to keep spill batches from being too small in low memory.
|
static double |
PAYLOAD_FROM_BUFFER
Given a buffer, this is the assumed amount of space
available for data.
|
static double |
WORST_CASE_BUFFER_RATIO
On really bad days, we will add one more byte (or value) to a vector
than fits in a power-of-two sized buffer, forcing a doubling.
|
| Constructor and Description |
|---|
SortMemoryManager(SortConfig config,
long opMemoryLimit) |
| Modifier and Type | Method and Description |
|---|---|
SortMemoryManager.MergeTask |
consolidateBatches(long allocMemory,
int inMemCount,
int spilledRunsCount)
Choose a consolidation option during the merge phase depending on memory
available.
|
long |
freeMemory(long allocatedBytes) |
long |
getBufferMemoryLimit() |
SortMemoryManager.BatchSizeEstimate |
getInputBatchSize() |
long |
getMemoryLimit() |
int |
getMergeBatchRowCount() |
SortMemoryManager.BatchSizeEstimate |
getMergeBatchSize() |
long |
getMergeMemoryLimit() |
int |
getPreferredMergeBatchSize() |
int |
getPreferredSpillBatchSize() |
int |
getRowWidth() |
int |
getSpillBatchRowCount() |
SortMemoryManager.BatchSizeEstimate |
getSpillBatchSize() |
boolean |
hasMemoryMergeCapacity(long allocatedBytes,
long neededForInMemorySort) |
boolean |
hasPerformanceWarning() |
boolean |
isLowMemory() |
boolean |
isSpillNeeded(long allocatedBytes,
long incomingSize) |
boolean |
mayOverflow() |
static int |
multiply(int byteSize,
double multiplier) |
boolean |
updateEstimates(int batchDataSize,
int batchRowWidth,
int batchRowCount)
Update the data-driven memory use numbers including:
The average size of incoming records.
The estimated spill and output batch size.
The estimated number of average-size records per
spill and output batch.
The amount of memory set aside to hold the incoming
batches before spilling starts.
|
public static final double INTERNAL_FRAGMENTATION_ESTIMATE
[____|__$__]In the above, the brackets represent the whole vector. The first half is always full. The $ represents the end of data. When the first half filled, the second half was allocated. On average, the second half will be half full. This means that, on average, 1/4 of the allocated space is unused (the definition of internal fragmentation.)
public static final double PAYLOAD_FROM_BUFFER
public static final double BUFFER_FROM_PAYLOAD
public static final double WORST_CASE_BUFFER_RATIO
public static final int MIN_ROWS_PER_SORT_BATCH
The number is also used for logging: the system will log a warning if batches fall below this number which may represent too little memory allocated for the job at hand. (Queries operate on big data: many records. Batches with too few records are a probable performance hit. But, what is too few? It is a judgment call.)
public static final double LOW_MEMORY_MERGE_BATCH_RATIO
public SortMemoryManager(SortConfig config, long opMemoryLimit)
public boolean updateEstimates(int batchDataSize,
int batchRowWidth,
int batchRowCount)
Under normal circumstances, the amount of memory available is much larger than the input, spill or merge batch sizes. The primary question is to determine how many input batches we can buffer during the load phase, and how many spill batches we can merge during the merge phase.
batchDataSize - the overall size of the current batch received from
upstreambatchRowWidth - the average width in bytes (including overhead) of
rows in the current input batchbatchRowCount - the number of actual (not filtered) records in
that upstream batchpublic SortMemoryManager.MergeTask consolidateBatches(long allocMemory, int inMemCount, int spilledRunsCount)
Logic is here (returning an enum) rather than in the merge code to allow unit testing without actually needing batches in memory.
allocMemory - amount of memory currently allocated (this class knows the total
memory available)inMemCount - number of incoming batches in memory (the number is important, not
the in-memory size; we get the memory size from
allocMemory)spilledRunsCount - the number of runs sitting on disk to be mergedpublic static int multiply(int byteSize,
double multiplier)
public boolean isSpillNeeded(long allocatedBytes,
long incomingSize)
public boolean hasMemoryMergeCapacity(long allocatedBytes,
long neededForInMemorySort)
public long freeMemory(long allocatedBytes)
public long getMergeMemoryLimit()
public int getSpillBatchRowCount()
public int getMergeBatchRowCount()
public long getMemoryLimit()
public int getRowWidth()
public SortMemoryManager.BatchSizeEstimate getInputBatchSize()
public int getPreferredSpillBatchSize()
public int getPreferredMergeBatchSize()
public SortMemoryManager.BatchSizeEstimate getSpillBatchSize()
public SortMemoryManager.BatchSizeEstimate getMergeBatchSize()
public long getBufferMemoryLimit()
public boolean mayOverflow()
public boolean isLowMemory()
public boolean hasPerformanceWarning()
Copyright © 2017 The Apache Software Foundation. All rights reserved.