A Practical External Sort for Shared Disk MPPs
Published in the Proceedings of Supercomputing '93
by Xiqing Li, Gordon Linoff, Stephen
J. Smith, Craig Stanfill, and Kurt Thearling
Abstract: An external sort has been
implemented and analyzed for a shared disk MPP computer system.
In this implementation, we have considered many real world
constraints. Decision support functionality in database systems,
for instance, often requires that external sorting be done in
place on disk, support variable length records, and be
restartable from any point of interruption with no loss of data.
These three constraints, along with the more standard
requirements of speed and stability, affect the choice and
implementation of the external sorting algorithm. The
implementation of the sample sort algorithm described here meets
these requirements. Although written using high level file
processing directives, the implementation sorts a 10 GB file in
1.5 hours on a 64 processor Connection Machine CM-5 with a
DataVault disk system.
1.0 Introduction
Our intention in building an external sort (where the size of the data to be sorted is larger than main memory) was to produce a system that met the requirements of real world problems and could be implemented in a reasonably short period of time with a high level programming model. Given these goals we pursued an external version of the sample sort algorithm, at times sacrificing performance in favor of simplicity and portability by using existing high level file access routines (close, open, read, write, seek, append, delete and truncate). Applying this strategy, a sorting system capable of sorting 10GB of data in under 1.5 hours was designed and developed in about 6 staff-months. Because the code is written in C using a MIMD (Multiple Instruction Multiple Data) message passing library and high level file access routines, the algorithm and our results should map well to other parallel computers with single-input single-output disk access.
1.1 Practical requirements
In designing and implementing a practical external sort useful for scientific and commercial databases, we determined that the following features in addition to speed and stability were important:
We assumed that records would consist of two parts: a fixed-length header containing the length of the record followed by a variable number of fixed-length fields. The start of a record could be determined only by extracting the position and length of the preceding record. Such a format is common for data files produced by serial machines but it proves to be a bit of a nuisance when trying to find record starts on a parallel machine. However, the fact that record location in this case is an inherently serial process did not significantly affect total system performance when implemented on a massively parallel computer.
The requirements that the sort be in-place and restartable are complementary. To be in-place, sorted data must overwrite the original data. Since it is critical that the original data not be lost, the sort must be able to be restarted from any point in the algorithm. The algorithm itself is divided into various steps. Between steps, important data is stored in a checkpoint file which allows the sort to resume from the checkpoint.
Though our intent was to provide a real world sort we did assume some simplifications for the construction of keys. Specifically:
These two assumptions about the keys were made so that we could focus on the algorithmic details of the sort. Internally, we chose a local quicksort and an internal sample sort instead of a radix sort in order to allow the sort to be comparison based and eventually support complex composite keys [5]. Such composite keys might consist of data from anywhere within the record and with a variety of data types (floating point, integer, ASCII, EBCDIC) and sorting direction (ascending or descending).
2.0 The external sample sort algorithm
External sorting algorithms have recently been placed into two classes: those that perform a full, in memory, internal sort before redistributing the records and those that perform a redistribution of the keys followed by an internal sort [2][14]. Mergesort and external hyperquicksort fall into the first class [23], and sample sort [6] falls into the second class. In either case the response time of the system is often determined by the permutation of the data on disk and not by the in memory sort itself. This is especially true in many real world applications where the keys make up only a small portion of the total record length. In these cases, all the keys can often fit in the main memory of the computer at one time and can be sorted there in a single pass. This seeming advantage provides little or no benefit in reality since a randomized permutation on disk is no faster than a sort except for unrealistically small memory sizes [1]. Thus in terms of overall speed for the disk permutation there are two dominant factors - the number of times the data is moved between disk and memory and the number of random disk accesses required for these moves.
The goals of our external sort further limited the choice of algorithm. In particular, the requirements of being in-place and restartable strongly suggested a sample sort algorithm. To reach the performance goals, we had to minimize the amount of I/O even on very large data sets. For these reasons we chose a sample sort algorithm that required three reads and two writes of the data but is O(D^{2}) in the number of disk accesses ( D = data size in bytes).
For the largest data file actually sorted for this paper (10GB), the quadratic cost of disk accesses was not the dominant factor but did approach 25% of the total time. For significantly more data or smaller memory the time spent in random disk access would become the dominant cost. The sample sort algorithm can be extended to reduce this cost of disk access by performing multiple passes of the algorithm, effectively trading the cost of more data movement for fewer disk accesses. Another advantage of sample sort is that it well fit our file system primitives and provided a simple, high-level implementation.
2.1 The algorithm
The sample sort algorithm proceeds by first partitioning the records to be sorted into several bucket files, each smaller than main memory. The records in one bucket file are ordered with respect to the records in the next (ordered for example by “less than”). These bucket files are then loaded into main memory, sorted and appended to a growing sorted file of all the data.
In order to perform the partitioning of the data into ordered buckets, splitting values must be determined that define the boundaries between each bucket. For example, consider three buckets defined by the two splitters 362 and 1098. All records whose keys < 362 go into bucket 0; records where 362 <= key < 1098 go into bucket 1; and all other records go into bucket 2. To obtain the splitters that perfectly partition the records between buckets would require a sort in and of itself and thus defeat the purpose of finding the splitters. To circumvent this, a sampling of the keys is performed to estimate the splitters. To further refine this estimate, keys are often sampled at a much higher rate than actually required (oversampling). These candidate splitters are then sorted and only every j^{th} candidate splitter is chosen as a splitter of the actual bucket files (where j equals the oversampling ratio). Oversampling provides a better estimate of the perfect splitters and bounds can be placed on the variance of the splitters [6].
The sample sort algorithm, and its variants, has been implemented and improved on several times for internal (in memory) sorting on MPP systems [6][12][13][15][16] though it has not been often used for external sorting. A recent article has, however, shown an implementation for a shared-nothing disk architecture (a local disk per processing node) [11].
The shared-nothing architecture was shown to be useful for sample sort like algorithms using probabilistic splitting [11]. In their paper DeWitt et al. use a variant of the sample sort algorithm for a “multiple-input multiple-output” sorting problem where the data file is initially distributed across the disks. The final sorted file is required to end up partitioned into nearly equal sized sorted runs on each disk with all keys on one disk ordered with respect to all keys on the next disk. Their algorithm proceeds by gathering splitting candidates randomly from each disk and determining the splitters via a central coordinator. The splitters are then broadcast to each node and a distribution stage is entered where each record is directed via the splitters to the correct processing node. Each node collects the incoming records in a buffer and when the buffer is full, sorts it and then writes it out to disk. These sorted runs are later merged at each node.
Several modifications of the internal sample sort algorithm also needed to be made for the shared disk implementation presented in this paper. Specifically, the requirements of keeping the sort in place and restartable affected the overall design. There are three main stages of the external sample sort algorithm (see figure 1):
Figure 1: The three stages of the in-place, external sample sort. In stage 1 the unsorted data file is read to collect candidate splitters and to mark boundaries between variable length records (posts). In the second stage the original file is read backwards and truncated as the data is distributed to the buckets as determined by the splitting values. In the third stage each bucket file (which will fit entirely into memory) is sorted in turn and deleted as the final sorted file grows.
This algorithm will perform an in-place sort (free disk space equal in size to the memory is required as a buffer) and can be made restartable by writing out checkpoint information during each stage. In stage 2, we truncate the original data file after each memory load because the records in that chunk have already been added to the bucket files (since we can only truncate from the end of the file this necessitates reading the file backwards). In stage 3, we delete each successive bucket file as the final sorted file grows because each record in the bucket file has already been added to the final sorted file. Because all of the original data is always resident somewhere on disk (though at some points there may be up to one memory load of duplicate data), we do not have to worry about specifically backing up or checkpointing the data, which could otherwise become expensive and complicated.
This algorithm also has the advantage over mergesort in that it flushes memory of all data on each step within each stage. In mergesort, data loaded into memory from several sorted runs usually cannot all be written directly to the sorted file. This remaining data in memory must then be elaborately checkpointed to determine exactly how much has been written to the sorted file and how much must still be retained in the original data file. A possible solution to this problem, though we have not seen it implemented, would be to sample the sorted runs as in [2] and to only load into memory data from each run that would fit in the current partition. This would, however, also require searching the sorted runs for splitter break points, and since it could only be based on partitions defined by sampling, there would always be the possibility that the partition would overflow usable memory.
For the sample sort algorithm described here the checkpoint data needed to restart and continue the sort at each stage is:
In addition to the constraints of restartability and keeping the sort in place, there is always the danger that the key sampling will be poor and a bucket file will be too large to fit in main memory. Internal sample sorts often solve this problem by resampling the original data and resplitting the data into buckets. In the case of an in-place external sort, restarting in this way is not possible since the original data has already been overwritten. The solution is to recursively call stages 1 and 2 of the external sorting algorithm on the overflowing buckets in order to divide them into smaller buckets that do fit into main memory.
2.2 The Internal Sort
The internal sort used in stage 3 is also a sample sort. We will here quickly review this internal sample sort algorithm and the differences between it and a similar implementation on a parallel computer slightly different from the one described in [6] (see [10][21] for MPP implementations of other algorithms).
For large data sets with large keys (> 8 bytes) previous work has shown that sample sort can be a superior algorithm to radix, bitonic, quicksort, and others [6]. Since these constraints closely follow those of our current problem, sample sort was used for the internal sort in stage 3 of the external sort. Note that a fast radix sort capable of sorting 1 billion 32-bit keys in under 18 seconds has been implemented on the CM-5 [22]. Because of the linear dependence of the radix algorithm on key length and the eventual need for a comparison-based sort for general keys, we abandoned radix and implemented the sample sort algorithm instead.
The main idea of the internal sample sort is the same as for the external version - redistribute the original data to buckets where every key in each bucket is ordered with respect to every key in the next bucket, and then sort the buckets. In this case the buckets are not stored in files but in the memory of each individual processor. The splitters between the buckets are again determined by oversampling. There are four main stages in the internal sample sort:
The main differences between the current internal sample sort and that described in [6] are:
2.3 Detailed algorithm description
Below is a description that provides the necessary detail to
implement the external sample sort in-place and to fulfill the
restartability requirement.
3.0 Implementation and performance
This external sort was written in C running on each parallel node, with a MIMD message passing library and file system primitives provided by the CM-5 CMMD package [4][17][8]. Users write standard C code that runs independently on each processing node, interspersing interprocessor communication and synchronization calls where required. All disk data transfers were accomplished with a file system model of the disk (e.g. the buckets for the sample sort are files rather than locations within files or specifically mapped sections of the disk). This high level model allowed simpler implementation of the system although it hid some important issues such as fragmentation which eventually need to be addressed for very large files.
The CM-5 consists of from 16 to 16,384 processing nodes, each using the SPARC chip set with accompanying optional ASIC vector units. The external sort does not currently use the vector units. The processing nodes are independent and have their own local memory of up to 32MB. They communicate with each other via two networks: the Data Network and the Control Network. The Data Network is used for point-to-point processor communication. The Control Network is used for broadcasting data, synchronization, and parallel prefix operations (scans).
The CM-5 disk system used for these experiments was the DataVault. It is a RAID (Redundant Array of Inexpensive Disks) level 2 disk system (32 data disks, 7 ECC disks, and 3 spares) capable of 20 MB/sec for data transfer [20]. The DataVault transfers data to the processing nodes over the data network and thus is physically and computationally distinct from the processing nodes (see Figure2). A higher performance, scalable, RAID level 3 disk system is now available on the CM-5, though the results presented in this paper were obtained by running on a DataVault.
Figure 2: Shared disk MPP with RAID 2 disk system. Performance figures presented in this paper were obtained on a Connection Machine CM-5 with a DataVault disk system.
Multiple DataVaults can be added to a CM-5 for increased bandwidth; they attach at available addresses on the data network similar to the way a processing node does. For this research our CM-5 consisted of 64 processor nodes, each with 16 MB of local memory (1 GB total), no vector units, and a single DataVault shared among the 64 processing nodes. Because only a single DataVault was used and its disks were not accessed independently this algorithm would be classified as a single-input, single-output shared disk system (as opposed to the multiple input, multiple output, “shared-nothing” architectures reported by others [11][18]).
The average length of the records for the performance benchmarks reported was 2,300 bytes and varied uniformly between 600 and 4,000 bytes. Each key was an unsigned integer of 4 bytes. The system was tested on 6 different sized files of 1, 2, 3, 5, 7, and 10GB and the results are shown in Table1:
Data (GB) |
Stage 1 |
Stage 2 |
Stage 3 |
Total |
1 |
71 |
156 |
218 |
0.12 |
2 |
136 |
310 |
418 |
0.24 |
3 |
201 |
529 |
607 |
0.37 |
5 |
316 |
921 |
1042 |
0.63 |
7 |
479 |
1507 |
1559 |
0.99 |
10 |
671 |
2457 |
2266 |
1.49 |
Table1 shows that as the database size increases the time spent in stage 1 increases nearly linearly but the time spent in stage 2 and 3 increases much more quickly. We expect that this is due to the quadratic growth in the number of disk accesses with respect to database size that occurs in stage 2. We might also expect to see a similar effect in stage 3 due to disk fragmentation as the larger files took up most of the available space on disk. (There was approximately 12GB of total available disk space leaving only 2GB of free space for the 10 GB file; fragmentation likely occurred).
4.0 Analysis
4.1 I/O model
Moving data between disk and memory has two primary components that determine the time spent: the overhead for accessing the disk (including movement of the disk head and rotational latency) and the data transfer time. Previous analyses of sorting algorithms have combined these two costs into a single measure describing the transfer of an I/O or disk block [1][9][19]. The assumption has been that I/O can be divided into fixed-sized blocks, each of which is large enough so that the data transfer time dominates the overhead. This is an appealing model since it allows an algorithm’s performance to be modeled solely on the number of block I/Os instead of having to consider the amount of data transferred and the number of accesses separately. There are, however, several disadvantage to using a fixed block size for analyzing the algorithm presented in this paper.
In our case the data transfer rate from disk is approximately 20 MB/sec and though there are no preset restrictions on the block size the latency of the DataVault is relatively high, roughly 200 ms for reads. If a preselected block size were used such that latency were only 10% of the total transfer time, the block size would have to be 36 MB (2 seconds total access time consisting of 0.2 seconds for latency and 1.8 seconds for data transfer). This is too large for most applications of the sample sort algorithm. There will in fact be cases where we will want to allow the fraction of time spent in disk access to grow or shrink in order to optimize the performance over the entire algorithm. For example, the most I/O intensive part of the algorithm is stage 2 where chunks are read into memory and split into buckets. We will call the amount written to each bucket from a single chunk a bucket-chunk. It represents the maximum amount of data that can be written out at one time during this stage of the algorithm. If the size of a bucket-chunk is ever smaller than the preset block size, bandwidth will be used inefficiently. This can happen for large data sets.
For the results shown in this paper the chunk size was 300 MB and the average bucket size was 240 MB (smaller than the chunk size to allow for variations caused by sampling). The bucket-chunk size, though, varies depending on the number of buckets and hence on the size of the input. For a 1 GB file, each chunk of 300 Mbytes is written to 5 buckets, so the average size of a bucket-chunk is 60MB. For a 5 GB file, there are 22 buckets and the average bucket-chunk size is 14 Mbytes. For a 10 GB file, the average is only 7 Mbytes. When there is much more than 1 Gbyte of data, the average size is considerably smaller than the 36 MB chosen to minimize latency. Using a fixed block size in such cases wastes bandwidth.
Fortunately, the shared-disk system allows arbitrary and dynamic block sizes for I/O. We have chosen to model our algorithms by separating the access latency and data transfer costs. This more detailed analysis shows that latency becomes a significant factor as the data grows. At certain ratios of data to memory size, the splitting phase of the sample sort should be performed in multiple passes, paying the expense of moving the data more often in order to dramatically reduce the cost of random disk access.
4.1 Data storage/transfer hierarchy
The transfer of data between disk and memory coupled with disk
access latencies are the dominant costs of this and most other
parallel external sorting algorithms. The remaining costs usually
decrease dramatically (often by more than a factor of 10) as the
data become more and more localized to the individual processing
unit. A useful model of an external sort could thus be made by
modeling all “computation” as data movement.
We would like to explore such a simplified model to analyze the
external sample sort algorithm. For a typical serial machine,
these levels of data storage for memory access consist of the
disk, the memory, cache memory, and registers. For the CM-5, a
similar hierarchy exists with some additional levels of
distinction representing other important differences in the rates
at which the data can be moved. The important data storage and
transfer levels for the CM-5 are shown in the Table 2 below:
Storage Access Level |
Transfer Rate |
Access Latency |
Disk |
0.020 |
0.2 |
Interprocessor |
0.256 |
.000004 |
Intraprocessor |
1.024 |
negligible |
Intraprocessor |
16.384 |
negligible |
To model the performance of a system, as we have above, requires simplifying many of its complex behaviors that take advantage of data locality such as caching, and DRAM paging. For our purposes, the simplified level of abstraction should model the system adequately. The important aspects of the system to note are:
Given this we can see, for instance, that if disk I/O is used
as frequently as other operations, then the I/O is the dominating
cost and that improving the processing performance (by adding
vector units, for instance) would have little impact on overall
system performance.
The model
We would now like to build a model to predict where the time will
be spent for different amounts of data. The following are
important parameters for our model:
P | Number of processors | |
D | Total data (GB) | |
K | Key length + record number (GB) | |
K' | Key length + record number + auxiliary location information (GB) | |
R | Record length (GB) | |
M_{d} | Usable memory for sampling and distribution (stages 1,2) (GB) | |
M_{s} | Usable memory for internal sort (stage 3) (GB) | |
B_{d} | Disk data bandwidth (GB/sec) | |
B_{i} | Interprocessor data bandwidth (GB/sec) | |
B_{m} | Intraprocessor data bandwidth (GB/sec) | |
L_{r} | Total disk access latency for reads (secs) | |
L_{w} | Total disk access latency for writes (secs) |
Note that the interprocessor (between processor) and
intraprocessor (in memory) bandwidth refers to the bandwidth for
the entire computer, not for individual processors. Note also
that not all of the system memory can be used to store the data
during the internal sort and the distribution to buckets.
Consequently we have broken up the memory into two different
parameters (M_{s} and M_{d}). This is because the
portions of the algorithm used in these sections are not
completely in place. For instance the internal sort requires a
buffer slightly larger than the size of the data to be sorted.
We can now model the external sort based on the assumptions
already made about the performance of the CM-5. The model of the
time for each stage of sample sort is as follows:
The time of the first stage (line 1.1 above) is modeled by the costs of reading all of the data from disk (D/Bd) and the cost of rearranging it across the processors (D/Bi). This rearrangement of the data is necessary to accommodate load balancing between the processors and to allow for the disk data structure to be independent of the number of processors on the CM-5. The final term on line 1.1 reflects the costs of disk access per chunk where the total number of chunks in the file is Ceiling(D/M_{d}).
In stage two there are three main sections: line 2.1 where the data is read in by chunks (as in stage 1), line 2.2 where the data is locally ordered by a quicksort and then split into buckets, and line 2.3 where the local buckets on each processor are written to the buckets resident on disk. It is important to note that line 2.3 contains a term that is quadratic with respect to the size of the database. This term arises because for every chunk that is read into memory the data must be distributed to every bucket and there are Ceiling(D/M_{s}) buckets and Ceiling(D/M_{d}) total chunks. When the access cost of a write and/or the ratio of the database size to usable memory size is high, this will be an important term in the total cost of the algorithm.
The equation for the cost of the local quicksort on line 2.2 can be derived in the following way. The sort of the records is accomplished by sorting pointers to the records based on the keys and then locally reordering the records based on this rank. Locally reordering the records contributes the D/B_{m} term. The ranking of the records is derived by observing that there will be D/M_{d} parallel quicksorts in all and for each of these there will be M_{d} / PR records per processor. Each quicksort will then make log(M_{d}/PR) log(Md/PR) comparisons of K gigabytes of data (where log = log2). Since the data bandwidth of the local memory of each processor is B_{m}/P the equation would be:
which simplifies to the second term on line 2.2.
We model the key comparisons within the quicksort as simple data movement even though they will be somewhat more expensive and there should be some multiplicative constant for this term. This abstraction does not seriously compromise the model, however, as the cost of the intraprocessor data movement is minimal in comparison to the costs of interprocessor and disk data movement. The dominating terms of this model will, in fact, be the reads and writes of all the records that occur in each stage and the quadratic number of disk seeks that occur in stage 2.
Stage three is modeled by a read (line 3.1), an internal sort (line 3.2), and a write (line 3.3). The read and the write are similar to those in stages 1 and 2, except that the cost of disk access is dependent on the number of buckets, not on the number of chunks (lines 1.1 and 2.1) or on the product of the number of chunks and buckets (line 2.3)
The internal sample sort of stage 3 (line 3.2) is modeled by an intraprocessor quicksort (similar to line 2.2), two sends of the keys and auxiliary location information, and a single interprocessor permutation of the full record data to sorted order.
4.4 Comparing the model to experimental results
It is often important and instructive to compare any theoretical results with an actual implementation as was done in [3][7]. This model of the sample sort is compared to the actual performance data in figure 3. The model proves to be a good match for the first stage of the algorithm, but the costs in the second and third stages grow at a faster rate than predicted by the model. Since the only non-linear term in the model is the latency of disk access, we may consider that the actual cost of random disk access may be higher than the 200ms that was directly measured. We further expect that some of the mismatch between the model and the actual timings in stage 3 is due to file fragmentation, which would also contribute a quadratic disk access term not explicitly included in the model. A multipass version of the external sample sort can, however, decrease this quadratic number of disk accesses and the degree of file fragmentation.
Figure 3: Comparison of the performance model and actual performance. The model of the sample sort algorithm fits the actual data well for the first stage. For the second and third stages the actual time to sort is growing faster than the model. Since the only non-linear term in the model is latency of disk accesses, it is hypothesized that the model underestimates this time or should have another non-linear term, perhaps to account for disk fragmentation.
Since the model is a relatively good match to the actual performance,we can use it to see how much time is being spent in moving the data at each level of the memory hierarchy. Figure 4 shows the percentage of time spent in moving the data at each of these levels. It also includes the latency of disk access which, though not a dominant term for these size runs, is growing at a greater rate than the other factors.
Figure 4: Costs of data movement from model. This figure shows that data transfer between disk and memory is the dominating cost for the external sort over the region studied. Data movement between processors contributed some 15% and local data movement approximately 1% of the total time.
Figure 5 shows the model when the database sizes are allowed to grow up to a terabyte while the memory is held constant. Here the quadratic number of disk accesses becomes the dominating term for large databases and methods to minimize the number of accesses will be critical. It may be, however, that the memory of the computer will be increased proportionately to the growth of the database and thus the cost of disk access could be held nearly constant. In this case, as in the cases we ran experimentally, the data transfer between disk and memory would need to be minimized; speeding up other sections of the algorithm would have little effect on the overall performance of the system.
Figure 5: Differing costs of data movement with database size. The model was run for databases from 0.001 GB to 1000 GB and the amount of time spent at each level of the memory hierarchy was recorded. This graph shows that if memory size is held constant as the database continues to grow, that eventually the disk access latency becomes the dominating cost. Large databases pay heavily for disk access because the number of buckets is so large and very small databases also pay heavily for latency because of the very small data transfers used in reading and writing the files.
5.0 Conclusion
The research presented in this paper reflects the effort to produce a practical external sort on existing shared disk MPP computer systems. We have shown that several of the real world constraints of both scientific and commercial users, such as keeping the sort in place, can have dramatic effects on the performance. Specifically, we have shown that, though it is relatively easy to perform an external sort in three full reads and two full writes, the cost of random disk access can quickly become the dominant factor as the number of accesses grows quadratically with the database size. There may be ways to reduce this cost and the cost of file fragmentation by perhaps trading off multiple reads and writes for a reduced number of disk accesses, but there may be even better ways to control file fragmentation on disk. Unfortunately, they require low level control of the disk system which, though of interest, would have limited the portability of our algorithm and the usefulness of our results to other shared disk MPP systems.
There are still several open questions that would be of benefit to answer. We would like to know, for instance, what effect a multiple pass version of the external sample sort algorithm would have on the number of disk accesses and on file fragmentation. If we could then incorporate this modification into our model it would be possible to optimize the number of passes in the multiple pass sample sort algorithm. We would also like to extend the sort to include complex keys, which is an important real world constraint that we considered but did not implement. Despite these important further improvements, this research has nonetheless already contributed a simple high level algorithm for external sorting that is relatively easy to implement and should map well to many shared disk MPP architectures. The parametric models, in fact, should allow for high level performance evaluations and optimizations of this algorithm even before it is implemented.
Acknowledgments
We would like to thank Michael Berry and Paul Barth for their help in determining the constraints of the sorting problem and Michael Best for his work in providing the high level file system interface. We would also like to thank Dave Waltz, Marco Zagha and Tom Cormen for their review and criticisms of early drafts of this paper.
References
A. Aggarwal, J. S. Vitter. "The Input Output Complexity of Sorting and Related Problems", CACM, vol. 31, no. 9, pp. 1116-1127, Sept. 1988.
B. A. Baugsto, J. F. Greipsland. "Parallel Sorting Methods for Large Data Volumes on a Hypercube Database Computer", Proc. of the Sixth Intl. Workshop on Database Machines, Springer-Verlag, pp. 127-141, 1989.
M. Beck, D. Bitton, W. K. Wilkinson. "Sorting Large Files on a Backend Multiprocessor", IEEE Transactions on Computers, vol. 37, no. 7, pp. 769-778, July 1988.
M. Best, A. Greenberg, C. Stanfill, L. Tucker. “CMMD I/O: a Parallel Unix I/O”, Submitted to: IEEE 7th International Parallel Processing Symposium, 1993.
G. E. Blelloch, L. Dagum, S. J. Smith, K. Thearling, M. Zagha. "An Evaluation of Sorting as a Supercomputer Benchmark" Submitted to: International Journal of High Speed Computing, 1993.
G.E. Blelloch, C.E. Leiserson, B.M. Maggs, C.G. Plaxton, S.J. Smith, M. Zagha. "A Comparison of Sorting Algorithms for the Connection Machine CM-2". 3rd Annual ACM Symposium on Parallel Algorithms and Architectures, July 1991, Hilton Head, SC, pp. 3-16.
P. Carnevali "Timing Results of Some Internal Sorting Algorithms on the IBM 3090". Parallel Computing, 6, North-Holland, 1988, pp. 115-117.
Thinking Machines Corporation. CM-5 Technical Summary. January 1992. Cambridge, MA.
T. Cormen. "Fast Permuting on Disk Arrays", Journal of Parallel and Distributed Computing (to appear)1992.
L. Dagum. "Parallel Integer Sorting With Medium and Fine-Scale Parallelism". International Journal of High Speed Computing (to appear), 1993.
D. Dewitt, J. F. Naughton, D. A. Schneider, "Parallel Sorting on a Shared-Nothing Architecture using Probabilistic Splitting". Proceedings of the International Conference on Parallel and Distributed Information Systems. Miami Beach, Florida. IEEE Computer Society Press. 1991.
W. Dobosiewicz. "Sorting by Distributive Partitioning". Information Processing Letters. v. 7, no. 1. January, 1978.
W. D. Frazer, A. C. McKellar. "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting". Journal of the Association for Computing Machinery, v. 17, no. 3, July, 1970, pp. 496-507.
G. Graefe. "Parallel External Sorting in Volcano", University of Colorado Technical Report, CU-CS-459-90, Boulder, Colorado, 1989.
W. Hightower, J. F. Prins, J. H. Reif. "Implementations of Randomized Sorting on Large Parallel Machines". Proceedings of the 4th ACM Symposium on Parallel Algorithms and Architectures, July, 1992.
J. S. Huang, Y.C. Chow. "Parallel Sorting and Data Partitioning by Sampling". Proceedings of the IEEE Computer Society’s Seventh International Computer Software and Applications Conference. pp. 627-631, November 1983.
C. Leiserson, et.al. “The Network Architecture of the Connection Machine CM-5”, 4th Annual ACM Symposium on Parallel Algorithms and Architectures, July 1992, .
R. A. Lorie, H. C. Young. "A Low Communication Sort Algorithm for a Parallel Database Machine". Proceedings of the Fifteenth Conference on Very Large Data Bases, pp. 125-134, Amsterdam, 1989.
M. Nodine. "Greed Sort: An Optimal External Sorting Algorithm for Multiple Disks", Brown University Technical Report No. CS-90-04, 1990.
D. Patterson, G. Gibson, R. Katz. "A Case for Redundant Arrays of Inexpensive Disks (RAID)". Proceedings of ACM SIGMOD, Chicago, pp. 109-116, 1988.
J.F. Prins, J.A. Smith. "Parallel Sorting of Large Arrays on the MasPar MP-1." Proc. of 3rd Symposium on Frontiers of Massively Parallel Computation, College Park, MD IEEE, October, pp. 158-167, 1990.
K. Thearling, S. J. Smith. "An Improved Supercomputer Sorting Benchmark." Proceedings of Supercomputing ‘92. 1992.
B.A. Wagar. “Hyperquicksort: A Fast Sorting Algorithm for Hypercubes”. In M.T. Heath, editor, Hypercube Multiporcssors 1987 (Proceedings of the Second Conrference on Hypercube Multiprocessors), pp. 292-299, Philadelphia, PA, 1987. SIAM.