Ch.06 Sorting

1 Preliminaries#

void X_Sort ( ElementType A[], int N )

N must be a legal integer
Assume integer array for the sake of simplicity
'>' and '<' operators exist and are the only operations allowed on the input data
internal sorting 内部排序

2 Insertion Sort#

void InsertionSort( ElementType A[], int N)
{
    int j, P;
    ElementType Tmp;

    for(P = 1; P < N; P++){
        Tmp = A[P];
        for(j = P; j > 0 && A[j-1] > Tmp; j--)
            A[j] = A[j-1];   // move backwards
        A[j] = Tmp;
    }
}

worst case, decending sequence, \(T=O(N^2)\)
best case, \(T=O(N)\)

3 A Lower Bound for Simple Sorting Algortihms#

An Inversion 逆序对，index 大小和值的大小相反，类比逆序数问题，假设逆序数为 n
There are \(n\) swaps needed to sort this list by insertion sort
\(T(N, I)=O(I+N)\) \(I\)
- 至少需要数组过一遍 \(N\)
- 每个逆序对都需要 swap \(I\)
- 如果本来就接近于排好了，那么接近于 \(O(N)\) Linear
The average number of inversion in an random array of N distinct numbers is \(N(N-1)/4\)
Any sorting algorithm that sorts by exchanging adjacent elements requires \(\Omega(N^2)\) time on average

3.1 Improvement#

每次排序交换相隔比较远的两个元素

4 Shellsort#

__assets/Ch.06 Sorting/IMG-Ch.06 Sorting-20250125004810877.webp

定义序列 \(h\)，然后进行多次排序，每次 \(h\) 减小，直到最后 \(h=1\)

4.1 Naive Shellsort#

\(h_t=\lfloor N/2\rfloor,\,h_k=\lfloor h_{k+1}/2\rfloor\)

void Shellsort( ElementType A[], int N )
{
    int i, j, increment;
    ElementType Tmp;
    for(increment = N / 2; increment > 0; increment /= 2)
        /*h sequence*/
        for(i = increment; i < N; i++){
            Tmp = A[i];
            for(j = i; j >= increment; j -= increment)
        }
}

4.1.1 Worst-Case Analysis#

\(Theorem\): The worst-case running time of Shellsort is \(\Theta(N^2)\)
bad-case可能出现前面 \(h>1\) 排完都不变，最后一次排才排好的情况
sequence 选择素数会有更好的效果

4.2 Hibbard's Increment Sequence#

\(h_k=2^k-1\)
\(Theorem\): The worst-case running time of Shellsort, using Hibbard's increments, is \(\Theta(N^{3/2})\)
\(Conjectures\)
- \(\overline T_{Hibbard}(N)=O(N^{5/4})\)
- Sedgewick's best sequence is \(\{9*4^i-9*2^i+1\} \cup \{4^i-3*2^i+1\}\)
  - \(T_{avg}(N)=O(N^{7/6})\) \(T_{worst}(N)=O(N^{4/3})\)

5 Heapsort#

如果不需要知道所有的顺序，只需要知道最大/最小的几个值，那么 Heapsort 更快

5.1 Algorithm 1#

Algorithm1
{
    BuildHeap( H );   /*O(N)*/
    for(i = 0; i < N; i++)
        TmpH[i] = DeleteMin( H );   /*O( log N )*/
    for(i = 0; i < N; i++)
        H[i] = TmpH[i];   /*O(1)*/
}

\(T(N)=O(N\log N)\)
con: The space requirement is doubled.

5.2 Algorithm 2#

不如直接将 deleteMax 的结果写道数组的后面去

void Heapsort( ElementType A[], int N )
{
    int i;
    for(i = N / 2; i >= 0; i--) /*Build Heap*/
        PercDown(A, i, N);
    for(i = N - 1; i > 0; i--){
        Swap(&A[0], &A[i]); /*DeleteMax*/
        PercDown(A, 0, i);
    }
}

\(Note\) that A[0] is a valid entry 第零个也是要排序的有效元素
对于任意的数组，平均比较次数 \(2N\log N-O(N\log\log N)\)
\(Note\) Although Heapsort gives the best average time, in practice it is slower than a version of Shellsort that uses Sedgewick's increment sequence 因为常数比较大

6 Mergesort#

6.1 Merge two sorted lists#

\(T=O(N)\)
两个 pointer，小的放入新数组，并且 ptr 右移一位，直到 ptr 都到达末尾

6.2 Mergesort#

void MSort( ElementType A[], ElementType TmpArray[], int Left, int Right )
{
    int Center;
    if(Left < Right){  // if there are elements to e sorted
        Center = (Left + Right) / 2;
        MSort(A, TmpArray, Left, Center);   // T(N/2)
        MSort(A, TmpArray, Center + 1, Right);   // T(N/2)
        Merge(A, TmpArray, Left, Center + 1, Right);   // O(N)
    }
}

void Mergesort( ElementType A[], int N )
{
    ElementType *TmpArray;   // need O(N) extra space
    TmpArray = malloc(N * sizeof(ElementType));
    if(TmpArray != NULL){
        Msort(A, TmpArray, 0, N-1);
        free(TmpArray);
    }
    else FatalError("No space for tmp array!!");
}

为什么要在外部定义 TmpArray
- 减少内存申请和释放
- 减少空间开销，递归深度 \(\log N\)，每层都要 \(N\) 长度数组，\(S(N)=O(N\log N)\)

void Merge( ElementType A[], ElementType TmpArray[], int Lpos, int Rpos, int RightEnd )
{
    int i, LeftEnd, NumElements, TmpPos;
    LeftEnd = Rpos - 1;
    TmpPos = Lpos;
    NumElements = RightEnd - Lpos + 1;
    while(Lpos <= LeftEnd && Rpos <= RightEnd)  // main loop
        if(A[Lpos] <= A[Rpos])
            TmpArray[TmpPos++] = A[Lpos++];
        else
            TmpArray[TmpPos++] = A[Rpos++];
    while(Lpos <= LeftEnd)   // copy rest of first half
        TmpArray[TmpPos++] = A[Lpos++];
    while(Rpos <= RightEnd)   // copy rest of second half
        TmpArray[TmpPos++] = A[Rpos++];
    for(i = 0; i < NumElements; i++, RightEnd--)
        // copy TmpArray back
        A[RightEnd] = TmpArray[RightEnd];
}

\[ \begin{aligned} T(1)&=1\\ T(N)&=2T(N/2)+O(N)\\ &=2^kT(N/2^k)+k*O(N)\\ &=N*T(1)+\log N*O(N)\\ &=O(N+N\log N) \end{aligned} \]

pro: 适用于 外部排序
cons
- 需要更多空间 \(S(N)=O(N)\)
- 数组的复制时很慢的

7 Quicksort#

7.1 The Algorithm#

void Quicksort ( ElementType A[], int N)
{
    if(N < 2) return;
    pivot = pick any element in A[];
    Partition S = {A[] \ pivot} into two disjoint sets:
        A1 = { a \in S | a <= pivot} and A2 = { a \in S | a >= pivot};
    A = Quicksort(A1, N1) \union { pivot } \union Quicksort(A2, N2);
}

Complexity
- Best case: \(T(N)=O(N\log N)\)
  - 每次 pivot 的选择都是中位数
- Average case: \(T(N)=O(N\log N)\)
Property
- 每次的 pivot 在后续的排序中位置不会变化，快的原因
- pivot 选择和 partition 是容易错的地方

7.2 Picking the Pivot#

A Wrong Way pivot = A[0]
- Worst case: if A[] is presorted, \(O(N^2)\)
A Safe Maneuver pivot = random select from A[]
- 随机数产生浪费资源
Median-of-Three Partitioning
- pivot = median(left, center, right) 取其中的中位数

7.3 Partition Strategy#

Double Pointer
- i 和 j，从两边往内走，直到一个
如果 pivot = key
- i 和 j 都停下来进行 swap
  - 虽然可能存在不必要的交换 \(\{1,1,1,1,1,1,1,1\}\)
  - 但是至少保证了划分的两部分大小相近
    - 否则 \(T(N)=O(N^2)\)

7.4 Small Arrays#

Problem: Quicksort is slower than insertion sort for small \(N\le 20\)
Solution: N 比较小时，就调用 insertionSort

7.5 Implementation#

ElementType Median3( ElementType A[], int left, int right )
{
    int center = (left + right) / 2;
    if(A[left] > A[center]) Swap(&A[left], &A[center]);
    if(A[left] > A[right]) Swap(&A[left], &A[right]);
    if(A[center] > A[right]) Swap(&A[center], &A[right]);
    Swap(&A[center], &A[right-1]); // hide pivot
    return A[right-1];
}

void Qsort( ElementType A[], int left, int right )
{
    int i, j;
    ElementType pivot;
    if(left + CUTOFF <= right){
        pivot = Median3(A, left, right); // select pivot
        i = left; j = right - 1;
        for(;;){
            while(A[++i] < pivot);
            while(A[--j] > pivot); // stop at invalid element
            if(i < j) Swap(&A[i], &A[j]); // adjust partition
            else break; // partion done
        }
        Swap(&A[i], &A[right - 1]); // restore pivot
        Qsort(A, left, i-1);
        Qsort(A, i+1, right);
    }else InsertionSort(A + left, right - left + 1)
}

void Quicksort( ElementType A[], int N )
{
    Qsort(A, 0, N-1);
}

Note: Median3 调用后，左中右的三个元素已经排好了，但是由于 ++i --j，实际上跳过了这来哥哥元素。

7.6 Analysis#

\(T(N) = T(i)+T(N-i-1)+cN\)
Worst case: \(T(N)=T(N-1)+cN=O(N^2)\)
Best case: \(T(N)=2T(N/2)+cN=O(N\log N)\)
Average case
- Assume the average value of \(T(i)\) for any \(i\) is \(\frac{1}{N}[\sum_{j=0}^{N-1}T(j)]\)
- \[T(N)=\frac{2}{N}[\sum_{j=0}^{N-1}T(j)]+cN=O(N\log N)\]

Quicksort to find the kth largest element

每次找到 pivot 进行 partition 之后，计算大的一边的元素个数，判断第 k 大的元素在哪边，直到其出现在 pivot
\(T(N)=O(N)\) Linear

Space Complexity 等于递归深度 \(O(\log N)\)
- 最坏情况 \(O(N)\)

8 Sorting Large Structures#

Solution: Add a pointer field to the structure and swap ptrs instead
Table Sort
- 对于所有的 key，先创建一个 table[]，用来存储目标的 index
  - 构成了 cycle，也就是通过 list index 和 table 目标位置构成的需要交换的 cycle!
  - 具体操作时
    - 先进行 tablesort
      - table 初始化为 [0, 1, 2, 3, 4, 5]
      - 依据 key 对 table 中的 index 进行排序
    - 然后交换 key
      - 遍历 list，只要出现了 list index != table element，那么就进行 cycle exchange
        
        记录这个 entry 以及这个 key
        
        进行轮换，直到遇到一个元素的 table element == entry 结束
- Note: Every permutation is made up of disjoint cycles
Analysis
- worst case, \(\lfloor N/2\rfloor\) cycles and \(\lfloor 3N/2\rfloor\) record moves
- \(T=O(mN)\) where \(m\) is the size of the structure

9 A General Lower Bound for Sorting#

\(Theorem\) Any algorithm that sorts by comparisons only must have a worst case computing time of \(\Omega(N\log N)\)
- Can be proved by decision tree

10 Bucket Sort and Radix Sort#

10.1 Bucket Sort#

对于离散、取值范围很小的数据
创建所有数据取值点的 bucket，每个称为 slot槽
线性遍历数组，是哪个就放在哪个 slot
\(T(N, M) = O(M+N)\)
- 如果 \(M\) 远大于 \(N\)，那么效率会很低，不如 quicksort

10.2 Radix Sort#

__assets/Ch.06 Sorting/IMG-Ch.06 Sorting-20250125005012173.webp

进行多轮排序，每次排一个数位，Least Significant Digit First
\(T=O(P(N+B))\)
- \(P\) 是 number of passes，也就是位数
- \(B\) 是 number of buckets

10.2.1 MSD (Most Significant Digit)#

高位分到 bucket 之后，每个 bucket 内部
- 可以使用任意的排序算法
- 可以并行操作

10.2.2 LSD (Least Significant Digit)#

高位的排序依赖于低位的排序
只能串行进行

11 HW#

After the first run of Insertion Sort, it is possible that no element is placed in its final position
- T
Shell sort is stable.
- F
- stable 指的是相同的元素在排序前后的顺序是不变的
  - 假设存在两个相等的元素，排序之后原本在前面的还是在前面
Mergesort is stable. T
During the sorting, processing every element which is not yet at its final position is called a "run". Which of the following cannot be the result after the second run of quicksort?
- 注意
  - 这里 quicksort 的 run 指的是递归深度，两层递归能够选定 3 个 pivot 位置
  - 这里的 quicksort 实现方式不是选中位数
- A 可能是 72 28
- B 可能是 2 72
- C 可能是 2 28
- D 不可能，只有 32