Skip to content

15 External Sorting

1 Introduction#

Quicksort on a disk is slow because of random access. Use Merge-sort to optimize.

1.1 Terms#

  • simplification
    • 将数据存在 tape 上,只能顺序访问
    • 至少有 3 个 tape drives
  • run: 归并子序列
  • pass: 初始创建归并段或者归并段大小增加一倍称为一个 pass
  • \(\text{\#passes}=1+\lceil \log_{2} (N/M) \rceil\)
    • 有序归并段长度指数增长,所以有 \(\lceil \log_{2}(N/M) \rceil\)
    • 另外还有第一次排序

__assets/ADS 15 External Sorting/IMG-ADS 15 External Sorting-20241216134907423.webp

  • efficiency concerns
    • #passes
    • parallelism
    • block

2 \(k\)-way Merge#

\(k\)-way,表示一次合并有 \(k\) 个指针

  • 用 minheap 处理每个 way 的当前最小元素,每次写入最小的元素
  • 至少需要 \(2k\) tape,同一时刻有 \(k\) 条读、\(k\) 条写

2.1 Less tapes: Polyphase merge#

  • 2-way 3 tapes
    • 如果每次二等分操作,\(\text{\#passes}=1+2\log_{2}\lceil N/M \rceil\),因为可能出现需要一直
    • 如果不均匀分割,最优方法是近似成 fibonacci,\(\text{\#passes}=1+\log_{1.618}\lceil N/M \rceil\)
  • \(k\)-way \(k+1\) tapes
    • \(k\) 阶 fibonacci 序列:\(F_{N}=\sum_{i=N-k}^{N-1}F_{i}\),而且 \(F_{k-1}=1, F_{i}=0\text{ for } i <k-1\)

2.2 Buffers#

  • internal memory 分成 input buffer 和 output buffer,最小划分单位是 block
  • 每当一个 output buffer 写满,则输出到 disk
  • 双缓冲
    • 给 output buffer 分配一个备份,output 不会打断排序
    • 给 input buffer 也分配一个备份,input 不会打断排序
    • 所以一共 \(k\)-way 需要 \(2k+2\) 个 buffer

2.3 Longer run: Replacement selection#

__assets/ADS 15 External Sorting/IMG-ADS 15 External Sorting-20241216144413790.webp

输出时,在 heap 中继续放入下一个数,直到这个数比上一个输出的小,标记一个 cut

期望 run 长度为 \(2M\)

2.3.1 Minimize merge time#

__assets/ADS 15 External Sorting/IMG-ADS 15 External Sorting-20241216150417826.webp

永远只合并最小的,所以形成 Huffman Tree 的结构

3 Discussion#

3.1 Tournament Trees#

Question

Tournament tree is a complete binary tree with n external nodes and n−1 internal nodes. The external nodes represent the players and internal nodes represent the winner/loser of the match between the two players.

There are mainly two type of tournament tree:

  • Winner tree: each node represents the winner. The final winner is represented by the root.
  • Loser tree: The loser of the match is stored in internal nodes of the tree. But the overall winner of the tournament is stored at tree[0].

Tournament tree may be used in k-way merges.

What are the differences between winner and loser trees? Which one would you prefer for k-way merge? Why?

The winner tree stores the winner in internal nodes, while loser tree stores the loser in internal nodes. The loser tree is better for k-way merge. When the element on a tape is updated, all the necessary elements to compare is stored in its way to path, thus eliminating unnecessary comparisons.

3.2 The expected run length#

Question

Suppose that the internal memory can handle \(M = 7\) records at a time. Given the input sequence { 19, 12, 25, 31, 56, 21, 40, 16, 29, 14, 35, 23 }. What are the runs generated by the replacement selection?

What is the best case? How about the worst case?

Why is the expected length of a run generated by replacement selection \(2M\) (where \(M\) is the internal memory size)?

{12, 16, 19, 21, 25, 29, 31, 35, 40, 56}, {14, 23}

  • Worst case: the input sequence is sorted in reversed order, all runs have the length of \(M\).
  • Best case: the input sequence is already sorted, only 1 run is generated, and the sorting is done.

For a memory with the size off \(M\) records, the expected times that any record in the memory can be replaced in this method is \(1/0.5=2\). Thus, the expected length of a run is \(2M\).

4 Questions#

4.1 HW15#

4.1.1 One tape drive#

If only one tape drive is available to perform the external sorting, then the tape access time for any algorithm will be \(Ω(N^2)\).

Answer

T, 寻道时间从 \(\Omega (1)\)\(\Omega (N)\)

4.2 Parallel cycles#

Suppose we have the internal memory that can handle 12 numbers at a time, and the following two runs on the tapes:

Run 1: 1, 3, 5, 7, 8, 9, 10, 12

Run 2: 2, 4, 6, 15, 20, 25, 30, 32

Use 2-way merge with 4 input buffers and 2 output buffers for parallel operations. Which of the following three operations are NOT done in parallel?

A. 1 and 2 are written onto the third tape; 3 and 4 are merged into an output buffer; 6 and 15 are read into an input buffer B. 3 and 4 are written onto the third tape; 5 and 6 are merged into an output buffer; 8 and 9 are read into an input buffer C. 5 and 6 are written onto the third tape; 7 and 8 are merged into an output buffer; 20 and 25 are read into an input buffer D. 7 and 8 are written onto the third tape; 9 and 15 are merged into an output buffer; 10 and 12 are read into an input buffer

Answer

D 因为这个时候,第一个 way 原有的 (_, 7), (8, 9)(7, 8) 被 merge 了,但是仍然要等待 (10, 12) 在下一个周期中加载,再下一个周期才能继续 merge (9, 10)

4.3 Ex15#

4.3.1 To sort \(N\) numbers by...#

__assets/ADS 15 External Sorting/IMG-ADS 15 External Sorting-20241223161801827.webp

Answer

A 引入败者树后,比较和归并的次数与 \(k\) 无关!

Comments