外部排序模拟

外部排序一般应用于对大文件数据进行排序，文件太大加载不进内存或者很困难。可以利用归并排序分成小文件有序在合并成大文件。在数据库系统中很常见（filesort）

我们先写个简单的程序生成0-10000000的数字，生成1000000个，在一个大文件。

py
import random

with open('numbers.txt', 'w') as file:
    for _ in range(1000000):
        num = random.randint(0, 10000000)
        file.write(str(num) + '\n')

然后进行外部排序，也就是归并排序，先定义块的大小，定义20000个数一个文件，这样会生成50个文件。

写个函数把这个大文件分成50个小文件然后把这50个小文件的数字都排好序，最后进行归并

一般的归并排序如下假如有以下序列

8 4 5 67 54 2 78 89 98 12 34 2

分成三组

8 4 5 67

54 2 78 89

98 12 34 2

排序后

4 5 8 67

2 54 78 89

2 12 98 34

进行两两归并

1.1,2取出最小数2, ---2

2.3,4组去除最小数4 ---2,4

3.1,2组取出最小数5 ---2,4,5...

最后得到有序序列，两两归并肯定慢，所以有了n路归并，但是难道每次要遍历这n个数或者对这n个数排序吗？那确实，没办法，我么可以用一个小根堆来维护这个序列最小值，每次把小文件第一个数放进堆里，这样就好办。

下面是分割文件排序的算法

py
import heapq

# 步骤1: 从原始文件中读取数据块并进行排序
def sort_and_save_chunk(chunk, chunk_num):
    # 对数据块进行排序
    chunk.sort()
    # 将排序后的数据块写入一个中间文件，命名为 chunk_{chunk_num}.txt
    with open(f'chunk_{chunk_num}.txt', 'w') as chunk_file:
        for num in chunk:
            chunk_file.write(f'{num}\n')

# 步骤2: 分割原始文件成多个小文件
def split_file(input_file, chunk_size):
    chunk = []  # 用于存储数据块
    chunk_num = 1  # 数据块编号
    with open(input_file, 'r') as f:
        for line in f:
            chunk.append(int(line))  # 从原始文件中读取数据并存储在数据块中
            if len(chunk) == chunk_size:  # 数据块大小达到 chunk_size 时，执行排序和保存
                sort_and_save_chunk(chunk, chunk_num)
                chunk = []  # 重置数据块
                chunk_num += 1
        if chunk:  # 处理剩余的数据块
            sort_and_save_chunk(chunk, chunk_num)

最后是归并，也就是重要的小根堆处理部分

py
# 步骤3: 归并排序
def merge_sorted_files(output_file, chunk_files):
    with open(output_file, 'w') as output:
        chunks = [open(chunk_file, 'r') for chunk_file in chunk_files]  # 打开所有中间文件
        data = [int(chunk.readline()) for chunk in chunks]  # 从每个文件中读取第一个数
        heap = [(num, i) for i, num in enumerate(data) if num is not None]  # 创建最小堆

        heapq.heapify(heap)  # 将堆转换为最小堆

        while heap:
            smallest_num, file_index = heapq.heappop(heap)  # 从堆中取出最小数
            output.write(f'{smallest_num}\n')  # 将最小数写入输出文件

            next_num = chunks[file_index].readline()  # 从相应文件中读取下一个数
            if next_num:
                next_num = int(next_num)
                heapq.heappush(heap, (next_num, file_index))  # 将下一个数加入堆

        for chunk in chunks:
            chunk.close()  # 关闭中间文件

最后验证我们的代码

避免失误，多次验证，重新生成下随机数就好

算法是正确的

参考：https://blog.csdn.net/ailunlee/article/details/84548950