org.apache.spark.util.collection
Insert the given key and value into the map.
Insert the given iterable of keys and values into the map.
Insert the given iterable of keys and values into the map.
When the underlying map needs to grow, check if the global pool of shuffle memory has enough room for this to happen. If so, allocate the memory required to grow the map; otherwise, spill the in-memory map to disk.
The shuffle memory usage of the first trackMemoryThreshold entries is not tracked.
Insert the given iterator of keys and values into the map.
Insert the given iterator of keys and values into the map.
When the underlying map needs to grow, check if the global pool of shuffle memory has enough room for this to happen. If so, allocate the memory required to grow the map; otherwise, spill the in-memory map to disk.
The shuffle memory usage of the first trackMemoryThreshold entries is not tracked.
Return an iterator that merges the in-memory map with the spilled maps.
Return an iterator that merges the in-memory map with the spilled maps. If no spill has occurred, simply return the in-memory map's iterator.
(Changed in version 2.9.0) The behavior of scanRight
has changed. The previous behavior can be reproduced with scanRight.reverse.
(Changed in version 2.9.0) transpose
throws an IllegalArgumentException
if collections are not uniformly sized.
:: DeveloperApi :: An append-only map that spills sorted content to disk when there is insufficient space for it to grow.
This map takes two passes over the data:
(1) Values are merged into combiners, which are sorted and spilled to disk as necessary (2) Combiners are read from disk and merged together
The setting of the spill threshold faces the following trade-off: If the spill threshold is too high, the in-memory map may occupy more memory than is available, resulting in OOM. However, if the spill threshold is too low, we spill frequently and incur unnecessary disk writes. This may lead to a performance regression compared to the normal case of using the non-spilling AppendOnlyMap.
Two parameters control the memory threshold:
spark.shuffle.memoryFraction
specifies the collective amount of memory used for storing these maps as a fraction of the executor's total memory. Since each concurrently running task maintains one map, the actual threshold for each map is this quantity divided by the number of running tasks.spark.shuffle.safetyFraction
specifies an additional margin of safety as a fraction of this threshold, in case map size estimation is not sufficiently accurate.