Recently, I needed to implement an external sorting algorithm that would work well on “modern” hardware (lots of CPU cores, fast NVMe devices). I did come up with an interesting variation of External Merge Sort.
I’ve made a visualization for the 3 stages of the algorithm:
- Run sorting:
- Split the input into chunks (runs) that fit in memory.
- Sort the run in memory.
- Sample values as pivot candidates.
- Parallel multi-way selection:
- Select global pivots from the pivot candidates chosen in the first stage.
- Using binary search, find the lower bound for each global pivot in each sorted run.
- Parallel merging:
- Using the global pivot offsets in the sorted runs, compute the length and offset.
- Run parallel merge for each global pivot.
This algorithm efficiently utilizes all available CPUs and IO channels throughout each stage and is asymptotically optimal.
- A. Aggarwal and J. S. Vitter, “The input/output complexity of sorting and related problems,” Commun. ACM 31, 9 (Sept. 1988)
- J. S. Vitter, “Algorithms and Data Structures for External Memory, Foundation and Trends in Theoretical Computer Science,” vol 2, no 4, 2006
- G. Graefe, “Implementing sorting in database systems,” ACM Comput. Surv. 38, 3 (2006)
- L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava, “Fundamental parallel algorithms for private-cache chip multiprocessors,” SPAA 2008
- W. Chen,Y. Liu, Z. Chen, F. Liu and N. Xiao, “External Sorting Algorithm: State-of-the-Art and Future Directions,” IOP Conf. Mater. Sci. Eng. 2020
- M. Rahn, P. Sanders and J. Singler, “Scalable distributed-memory external sorting,” 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), Long Beach, CA, USA, 2010