An Evaluation of Sorting as a Supercomputer Benchmark
NAS Report RNR-93-002
January 29, 1993
by Guy E. Blelloch, Leonardo Dagum, Stephen Smith, Kurt
Thearling, and Marco Zagha
We propose that sorting be considered an important benchmark for both scientific and commercial applications of supercomputers. The purpose of a supercomputer benchmark is to exercise various system components in an effort to measure important performance characteristics. In the past numerous benchmarks have been defined in an effort to measure the performance issues associated with numeric computing. These benchmarks stressed arithmetic operations (in particular, floating-point arithmetic). In recent years supercomputers manufactures have started to look closer at non-numeric processing tasks, such as databases and information retrieval. The ability to operate on large amounts of non-numeric data will be crucial in the future. In this paper we discuss the appropriateness of sorting as a benchmark for non-numeric computing tasks. We will describe previous work in this area and formally define a set of architecture independent sorting benchmarks.
Although sorting is one of the most studied problems in computer science, it has received relatively little attention in the field of supercomputing. Traditional vector supercomputers have been used primarily for numerical analysis and the processing of large, regular arrays. Sorting, on the other hand, is associated with non-numeric applications that typically offer little vectorization and traditionally have been implemented on scalar computers. The newer class of parallel supercomputers, however, do not rely as heavily on vectorization to achieve speedup. As such they have become attractive platforms for very large scale non-numeric applications capable of solving problems that previously were considered intractable. It seems then that the inclusion of sorting as a supercomputer benchmark is both timely and appropriate. In this paper we propose a formal, architecture independent description of sorting applications and introduce a set of specific benchmark cases with which sorting performance can be fairly evaluated. Our hope is that this work will be of benefit in evaluating both sorting algorithms and the computer architectures that they are implemented on.
There are several features of sorting which make it a desirable benchmark. First, it is simply described and well known as a problem; second, it can be easily scaled in size to provide progressively more difficult benchmarks, and lastly, by its nature, it exercises a system's ability to compare and move large amounts of data - often the most expensive portion of any scientific or commercial application. In a parallel processing system, this ability to move data corresponds to the bisection bandwidth of the system . In both shared and distributed memory machines the ability to move data efficiently will dictate performance in many commercial applications. Sorting could be considered the prototypical benchmark of data movement performance without having to create a contrived example.
In addition to overall speed there are many issues involved in defining a sorting benchmark, including: stability, determinism, memory efficiency, load balancing of the sorted data and the difficulties in extending the sort to perform a rank or to sort complex non-integer keys. In this paper we use the above six descriptors to define the sort being used and then evaluate its performance along four different dimensions: number of keys, key size, distribution of key values and initial allocation of data to memory.
For these benchmarks we have focused on the likely sorting applications that would be performed on supercomputers available today or in the near future. In general this means that we have focused on relatively large sorting problems of from approximately 100,000 to over 10 billion keys that range is size of 8 to 256 bits. This seems to characterize the spread of sorting cases encountered today in both scientific and commercial applications, and pushes just slightly into what we can expect supercomputers to be capable of in the near future. Even an in memory sort of 10 billion 256 bit keys should be possible soon with technological advances in memory capacity. The current generation of supercomputers are being built with tens to hundreds of gigabytes per system and in the near future we will undoubtedly see a supercomputer with a terabyte of RAM. We are today not that far away as a number of recently published results describe where at least 1 billion keys have been sorted on existing supercomputers [3, 39].
This paper is broken up into four main sections. In the next section we review previous work in using sorting as a benchmark. In the second section we give an overview of some of the most common internal sorting implementations and provide a high level taxonomy, breaking the various systems into counting sorts (e.g. radix), fixed-topology sorts (e.g. bitonic), and partitioning sorts (e.g. sample sort). The fourth section of our paper formalizes the six descriptors that define a ``sort'' and introduces four parameters that when varied can have significant impact on sorting performance. The fifth section introduces a set of specific benchmarks helpful in determining the performance of a sorting implementation over a wide range of variations along these four dimensions.
Benchmarks can be considered useful for comparing computer performance only when the benchmark problems are representative of the workload. Because supercomputers traditionally have been employed for numerical analysis, supercomputer benchmarks have emphasized numerical algorithms. None of the early benchmarks, including the Livermore Fortran Kernels , the Linpack Benchmark , the original NAS Kernels , or the PERFECT Club  have included sorting as a benchmark. All of these benchmarks, however, were conceived and designed primarily for vector supercomputers, (although recently some results from distributed memory parallel computers have been reported). The only existing supercomputer benchmark designed from the outset for parallel computers is the NAS Parallel Benchmarks . It is significant that this benchmark not only is the most recent, but also is the only one to include a sorting kernel.
Although not the first to suggest sorting as a benchmark for parallel computers, the NAS Parallel Benchmarks represent the first instance where a sorting benchmark has received widespread acceptance by the supercomputing community. Sorting as a parallel computer benchmark was first suggested by Francis and Mathieson . However, the primary goal of that work was to present a parallel merge algorithm with practical application to sorting on a shared memory multiprocessor, and not to outline a sorting benchmark for parallel computers. Sorting on vector computers was in effect used as a benchmark by Rönsch and Strauss , where the sorting performance of several Amdahl and Cray systems was compared. As a benchmark, however, that work met with limited acceptance and performance figures for only two other systems (IBM  and ETA ) were reported. The problem considered for that work was that of sorting N random numbers uniformly distributed in the interval (0,1) using seven different sorting algorithms. Francis and Mathieson also had suggested uniformly distributed random numbers for their sorting benchmark. Unfortunately, few real world data distributions are uniform, and the sorting performance observed on uniformly distributed data generally will not be representative of the performance achieved on non-uniform data distributions. This is especially true on parallel machines where, at least for some sorting algorithms, non-uniformity in the data will lead to poor load balance and consequently poor performance.
The sorting kernel in the NAS Parallel Benchmarks attempts to overcome this deficiency by purposely specifying a non-uniform data distribution created as the average of 4 random numbers in the interval (0,219]. The result is an approximate Gaussian distribution with variance . The variance can be decreased by increasing the number of random numbers averaged; however just the one distribution is considered for the benchmark.
There are at least three criticisms one can make of this kernel from the point of view of establishing a general sorting benchmark. This kernel was originally proposed because of its significance in parallel implementations of Monte Carlo simulations of neutral gases where only integer sorting is required. For this reason, it is not a general sorting benchmark, but a small integer sorting benchmark. In other words, the benchmark sorts integers in the restricted range (0,219]rather than the full word range (0,232]. This restriction on the range would not be an important criticism however, if the benchmark did not also allow unstable sorting. An efficient stable small integer sort can easily be used in building a radix sort. For a stable sort, one can extrapolate the performance of a machine for sorting on the full range given the performance on the restricted range. This, however, is not true for unstable sorts. The third and most relevant criticism is that just a single distribution with only moderately non-uniform data is considered. In real world applications the data distributions can have much greater non-uniformity, and to accurately gauge the performance of a system for general sorting problems one would like to benchmark it over a wide range of distributions. This paper describes a methodology for generating distributions of arbitrary non-uniformity to be used for a general sorting benchmark.
Since there are so many diverse approaches to sorting, hundreds of sorting algorithms have been proposed, for both serial and parallel machines. This section reviews some of the most practical parallel sorting algorithms, focusing on algorithms that have already been efficiently implemented on supercomputers. (For a broader treatment of parallel sorting, see the surveys by Akl  and Richards ). The purpose of the section is to outline the current state of the art in practical parallel sorting algorithms so that these might be referenced when looking for an efficient sorting algorithm for a new machine. The sorting algorithms we consider can be categorized into three general classes: counting-based sorts, fixed-topology sorts, and partitioning sorts.
Counting-based sorts work by treating keys as integers in the range <0...(m-1)>. Unlike comparison-based sorts such as quicksort , counting-based sorts determine the ordering of keys by counting the number of occurrences of each possible value, rather than by comparing pairs of keys. Counting sorts are an attractive alternative to comparison-based sorts since for n keys they run in O(n) instead O(n lg n)of time.
Stable counting sorts are important as building blocks for radix sorts, which are used for sorting integers that are too large to be sorted in a single application of a counting sort. Radix-based sorts work by breaking keys into digits and sorting one digit at a time using a counting sort. For example, a 32-bit integer could be treated as four 8-bit digits. The digit size is usually chosen to minimize the running time and is highly dependent on the implementation and the number of keys being sorted. The most common version of radix sort starts from the least significant digit and works only if the ordering generated in previous passes is preserved. In that case the counting sort must be stable.
One way to parallelize counting sort is to assign a different range of keys to each processor. For example, Baber's radix sort for Intel Touchstone Delta  performs a counting sort on values in the range <0...(p-1)> by assigning one bucket to each of the p processors and sending all keys with value i to processor i. While this algorithm works well for uniformly distributed keys, non-uniform distributions can cause a severe degradation in performance, and in the worst case, the algorithm can exceed the available memory. A similar approach to parallelizing counting sort, Dagum's queue-sort  for the Connection Machine CM-2, uses a fixed amount of memory for any distribution, but has a running time that depends heavily on the distribution. Furthermore, queue-sort is not stable because it is based on a parallel communication primitive that enqueues messages in an unspecified order. Thus it can not be used to build a radix sort. However, queue-sort is efficient for its intended application to particle simulation.
Another approach to parallelizing radix sort assigns a separate set of m buckets to each processor, allowing each processor to compute a histogram using only local computation. The histograms are then combined using parallel summing operations . This parallel radix sort algorithm has been efficiently implemented on the CM-2 , CM-5 , and Cray Y-MP . This algorithm has the advantages that it is stable, and that the time to compute the rank of the keys does not depend on the distribution. (However, on some machines the time to permute the keys based on the rank depends on the permutation. See Manzini's version of radix sort  for a version of radix sort that is completely distribution-independent.)
There are two minor disadvantages to radix sort: it does not perform well with large keys, since the running time is proportional to the key size, and it can not be executed in place (i.e. with no temporary memory) . However, radix sort has several advantages over other sorting algorithms. It is simple to implement, deterministic, load-balanced, stable, fast for short keys, and fairly efficient over a wide range of problem sizes. Furthermore, a radix-based rank operation can be implemented at no additional cost compared to a radix sort.
Fixed-topology sorting algorithms are algorithms that use a fixed interconnection network between the processors, such as a hypercube or a grid, and that require no data-dependent communication patterns.
The earliest and most famous of the fixed-topology sorts is Batcher's bitonic sorting network . There have been several implementations of Batcher's bitonic sort on parallel machines. These include an implementation for the CM-2 , the Carnegie Mellon/Intel iWarp  and the Maspar MP-1 [29, 18] as well as several others [36, 34]. If there are multiple keys per processor and a sequential merge is used after each communication, then the asymptotic running time on a hypercube is O((n/p)(lg n +lg2p))  and on a 2-dimensional grid it is O((n/p)(lg n Sqrt(p)) . Because of the small constant in the algorithm and the simple and oblivious communication pattern, the sort is quite efficient on most parallel machines, and it is often used as the sort to which other sorts are compared.
In addition to Bitonic sort, there are several other sorting algorithms that have oblivious routing patterns. Out of these both columnsort  and smoothsort  are reasonably practical when the number of keys is much larger than the number of processors (for p processors, columnsort requires keys to run most efficiently). Columnsort has been implemented on the CM-5 with running times that were not as fast as some of other sorts on the CM-5 (including radix and sample sorts), but which for a large number of keys were within a factor of 2 of the best running times . Considering that the CM-5 is not a hypercube, for which the sort is designed, this is a reasonably good performance.
The main advantages of fixed topology sorts is that their communication performance is oblivious to the distribution of the keys, and they are well suited for direct implementation on machines that don't efficiently support dynamic or irregular communication patterns. The disadvantage is that on machines that do have efficient point-to-point communication, the fixed-topology sorts can require more communication than other sorts. An additional advantage of bitonic sort is that it can be executed in place requiring no additional memory, although this prohibits the use of local merges, making the running time O((n/p) lg2n).
Partitioning sorting algorithms select a subset of the keys that partition the data and then use these partition elements to route keys to separate sets of processors. There are two main subcategories of partitioning sorts: parallel quicksorts, and sample sorts.
Several parallel variations of quicksort have been suggested, each of which uses a single key at each level of the recursion to partition the data. The simplest variation runs the recursive calls to quicksort in parallel . This variation does not offer very much parallelism since only a single processor is used for the initial partitioning: this partitioning requires O(n) time, so one can only expect an speedup over the serial algorithm. Wagar suggested a variation called hyperquicksort  that does the partitioning in parallel. The algorithm uses a hypercube connection topology, but the message traffic is not oblivious to the data. This sort initially distributes the keys evenly among the processors and at each step picks a pivot, distributes the pivot across the machine, and sends all the keys less than the pivot to one side of the hypercube and the greater keys to the other (this split is always done across the highest dimension of the sub-cube). This is applied recursively within each sub-cube. It is very important to pick a pivot that closely balances the two halves otherwise the load on the processors can become extremely imbalanced. Hyperquicksort has been implemented on the NCUBE/10  and the x . On the NCUBE it was shown to be about twice as fast as bitonic sort, but this was based on randomly generated keys (which would be expected to be a good distribution for the sort). Another variation of quicksort allocates a fair number of processors to each recursive call so that picking bad pivots will not lead to load imbalance . This variation is based on using segmented scans and has been implemented on the CM-2. Because of relatively high communication costs it is not competitive with the sample sort algorithm discussed below.
Another partition based sort is sample sort [17, 20, 31, 36, 42]. Variations of sample sort have been implemented on the CM-2 , the Maspar MP-1 , the CM-5 , and the Ametek/S14  and are the most efficient sorts for most of these machines. Assuming n input keys are to be sorted on a machine with p processors, the sample sort algorithm proceeds in three phases:
If necessary, a fourth phase can be added to load balance the keys, since the buckets do not typically have exactly equal size.
Sample sort gets its name from the way the p-1 splitters are selected in the first phase. From the n input keys, a sample of ps <= n keys are chosen at random, where s is a parameter called the oversampling ratio. This sample is sorted, and then the p-1 splitters are selected by taking those keys in the sample that have ranks s, 2s, 3s, ..., (p-1)s. It can be shown that if the samples are picked at random then it is extremely unlikely that any one bucket is more than a small constant larger than the average size bucket : this is true independent of the initial distribution. Some variations of sample sort use splitters that are chosen deterministically , these can have very bad performance with certain key distributions.
The main practical advantage of sample sort is that it greatly reduces the communication required over most of the other sorting algorithms. If there are enough keys per processor (more than p) then the splitters can be broadcast to all the processors without a serious overhead, and the data can be routed to its final destination with a single message. However, because of the need to distribute the splitters and the need to sort the sample, it does not work well when there are a small number of keys per processor. The cost of distributing the p splitters to each processors can be alleviated by running multiple passes , but this adds to the communication costs. Another disadvantage of sample sort is that the buckets are not perfectly balanced at the end. This can require extra memory and extra communication to balance the data.
Sorting data is often the dominating cost for any system that makes use of it. It is for this reason that so much has been written and so many algorithms have been developed and analyzed. For any real world application of sorting, however, the order analysis of an algorithm can often be of secondary consideration compared with the constants involved, and it may be that for particular problems certain algorithms may work significantly better than others. We have already seen some of this in the algorithms already mentioned; for uniform data distributions bucket sorts or unbalanced radix algorithms may suffice. For very large keys, counting sorts, such as radix, are unlikely to be optimal and for large amounts of data, partitioning algorithms such as sample sort become optimal. Though generally true, these observations provide no systematic way to evaluate various sorting algorithms in the context of both the sorting job at hand and the computer architecture it is being implemented on. In this section we will formalize some existing terminology for describing sorting and introduce and formalize four dimensions of sorting problems that affect performance.
In choosing a sorting implementation for a particular application there are usually only two main constraints:
The first constraint is obvious and perhaps the only one usually considered. This is understandable as speed is important and sorting is computationally demanding. It may then seem peculiar that our second constraint concerns what ``sorted'' really means. So much has been written about sorting that deciding whether a data set is or is not ``sorted'' should be well defined. This is not the case. There are many subtle but important variations on sorting. For instance, questions of stability and determinism may be far more important in the choice of a sorting algorithm than overall speed. Even the term ``sorted order'' is not well defined when considering parallel architectures with distributed memory.
Finding a precise definition of ``sorted order'' that is independent of machine architecture is difficult. For a serial machine it is generally assumed that ``sorted order'' requires neighboring elements of the sorted sequence to be allocated to adjoining memory locations. For a distributed memory machine this corresponds to block ordering, where each neighboring element of the sorted sequence is a neighboring element in each processor's memory except for required breaks between processors. However, it also is possible to allocate the sorted keys in a cyclic ordering, where neighboring elements of the sorted sequence are in the memories of neighboring processors. Either allocation pattern may be optimal for different applications but for our purposes we will assume something like block ordering, where neighboring elements in the sorted sequence are ``near'' each other in memory.
With this definition of sorted order we can state three formal requirements that must be met in order for a data set to be considered sorted:
These three requirements for sorting are rather general and cover almost every implementation of sorting on any architecture. To refine the distinction between sorting implementations, several other descriptors can be used. The user must determine whether these features are necessary as they will affect the overall performance. The following questions should be considered:
We have already seen how stability can be important if a counting sort is used as one of the passes for a sort being performed on multiple keys (as in the radix sort). For the radix sort, stability is required for the sort to work at all but it incurs no additional cost. In the cases of partitioned sorts such as sample sort and fixed-topology sorts such as bitonic, stability is accomplished by appending the starting position of the data to the least significant bits of the key. This incurs a cost throughout the algorithm as the element being sorted may be considerably longer than the original key.
It may or may not be important to have the sort run in exactly the same amount of time whenever it is called on the same data. The fixed-topology sorts, such as bitonic, are oblivious to the data or its initial allocation in memory and will always take the same amount of time. The counting-based sorts are, however, dependent on the communications network to perform random or irregular permutations. Because of this their running time could vary if the communications network is non-deterministic or if the initial allocation of the data to memory is changed causing different routing patterns which may or may not be more efficient for the given architecture. Partitioning algorithms such as sample sort suffer a similar fate and, in addition, their performance can also vary due to even slight variations in the random sampling of the splitting values. In this case poor sampling could result in an overload for a given bucket which would result in a longer local sort time or even an overflow where the algorithm might have to be restarted with a different random sample.
The maximum amount of additional memory used by a sort in any part of its run determines the maximal number of elements that can be sorted in memory. For external sorting, where the data does not fit within usable memory, it can also affect the run time of the algorithm. This is because the number of passes through the data and the number of random disk accesses performed by most external algorithms is based on the amount of data that can be sorted in memory. Thus it is conceivable that a slower yet more memory efficient sort might be the optimal choice as part of a large external sorting implementation.
For distributed memory parallel computers the third constraint of our basic definition may not be strong enough. Neighboring elements in the sorted sequence may be ``near'' each other in terms of the memory hierarchy (i.e. they are in successive memory addresses in each processor's memory), but there may be many more sequence elements in one processor's memory than in another's. For some uses of internal sorting this will be satisfactory. In other cases this will produce an imbalance in the amount of processing required of different processor nodes which will result in an overall increased cost. In such cases the data can be easily balanced with an additional enumeration and permutation. In cases such as an external sort where all the data is subsequently permuted and written to disk or in the implementation of a send-with-add permutation where memory collisions are summed, rebalancing the sequence is unnecessary. Counting based sorts and fixed-topology sorts do not result in unbalanced data allocations but partition based sorts such as bucket sort  and sample sort usually will.
Sorts are often used to implement ranks. Normally a rank is preferable to a sort when each key is only a piece of a much larger data element that must be permuted in memory based on the key. Since sorts are often extended in this way it is useful to consider the increase in complexity of the system and performance would be incurred with a rank. A rank can be implemented from a sort by appending a return address to each key and when the keys are in sorted order the enumeration of their positions is returned to this address as the rank. Since stability is often achieved by appending the initial position of the data element to the key, this tag can then also be used as the return address of the rank. For radix and for balanced sample sort this does not even incur the cost of an additional permutation. For fixed-topology sorts such as bitonic, which are not stable, appending the initial position to the key can significantly increase the total cost of the algorithm.
Though this paper is directed toward current supercomputer applications which typically involve sorting floating point numbers or large integers there are significant applications in sorting of databases where the keys are made up of complex combinations of smaller fields. Any sort that can sort integers can be used to sort floating point values by converting the float into an integer representation that achieves the same sorted order as the float and then translating back to the floating point format. For real world databases such a conversion to a comparable integer format may be difficult. An example might be sorting a payroll database where employees are sorted in descending order by salary and within the groups of employees with the same salaries by last-name and then first name in alphabetic order. With comparison based sorting such as bitonic and some implementations of sample sort this is made relatively easy by defining a comparison function. It is unlikely that a counting sort such as radix would be universally helpful in this case as the conversion of the complex keys to integers might be difficult and the resulting keys may be rather long.
The implementation or lack of implementation of any of the above descriptors may have an effect on the performance of the sort but in general they are constant multiplicative or additive factors. There are, however, factors that quantitatively affect sorting performance to different degrees depending on their magnitude. Once defined we can look at the performance of a sort along any of these axes and determine how well it will perform for the particular application. We have noted four particular dimensions along which sorting performance should be measured:
It would be nice if within this four dimensional space we could give performance figures for various implementations of sorting algorithms. It would also be desirable for there to be a standard measure of performance for sorting. The MSOPS (Millions of Sorting Operations per Second) measure has been used in  and , but unlike the the analogous MFLOPS measure it is highly dependent on many different factors and must be used only within the context of the full description of the sort and the four performance dimensions. To see this, consider that for a distributed memory parallel supercomputer the highest MSOPS ratings would likely be achieved with a rather small number of data elements that would all fit within the local memory of a single processor. Such an MSOPS rating would not reflect the performance for larger numbers of elements where the elements spanned several processors' memories and thus required interprocessor communications. Given these caveats MSOPS is nonetheless a useful measure when taken in context and we will use it here.
The number of keys is often the single most important factor determining the performance of a sorting algorithm. For a given architecture, some algorithms, such as bitonic sort, perform well for a small number of keys. Other algorithms, such as sample sort, pay a high initial cost but become progressive more efficient as the number of keys increases .
All sorting algorithms are dependent on the size of their key since key size multiplied by the number of keys reflects the total amount of data that must be permuted, and counted or compared. This is true for bitonic sort, sample sort and radix sort. In the case of radix sort, however, the key size also affects the number of permutations that must be performed and the size of the histogram used in counting.
Certain algorithms such as unbalanced radix sort  and bucket sort  behave poorly when the distribution of data values is non-uniform. In the worst case, where all values are identical, the entire data set will be allocated to and sorted by a single processor. Fixed topology sorts such as bitonic sort are unaffected by data distribution. Other algorithms can exploit non-uniformity in the data. If, for example, the data distribution is sparse (e.g. for 1 million elements there are only 100 different key values for a 64 bit key) it may be advantageous to use a hash table for element counting in the radix sort rather than a histogram. Additionally there may be cases where bits of the key are uniform across all elements and these can be noted and ignored in a radix implementation. To generally categorize the amount of skew of the dataset values Thearling  introduced an entropy measure that will be used here also.
The initial data allocation can also have a significant impact on performance. It is difficult to characterize all possible allocations that might incur performance penalties as this is highly dependent on the machine architecture. However, there are several allocation patterns that are common, such as initially presorted and reverse sorted data. These cases can have significant impact on the performance of the sort. For example, the sample sort can take advantage of presorted data by avoiding most interprocessor communication while fixed topology sorts such as bitonic permute the data between processors equally for any initial data allocation.
Having somewhat formalized the description of sorting problems and the dimensions on which the performance should be measured we now specify some test cases that will exercise sorting implementations along these dimensions. A thorough exploration of each of these dimensions in combination is not possible in a reasonable amount of time. Instead several cases have been chosen that reflect real world problems in both the scientific and commercial community. To keep the test cases to a reasonable number only one parameter is varied at a time. This is not the perfect way to collect the performance data but it should be sufficient to allow potential users to evaluate the strengths and weaknesses of each sorting implementation.
In order to measure the relative changes in performance as each dimension is varied, a base test case is defined consisting of approximately one hundred million elements (227), 64 bit keys, random values, and randomly allocated distribution. One hundred million elements may seem a high for the base case but it is still one order of magnitude less than the billion element sorts reported in  and  and should fall squarely in the middle of interesting results in the near future. (It is interesting to note that the size of our base case for internal sorting is eight times the size of what was the standard benchmark for external sorting in 1985 ). The base case key length is 64 bits as this corresponds to the common case of sorting double precision floating point numbers.
The number of keys is varied from to in multiples of 8. This range is broad enough to exercise any current supercomputer but may have to be expanded in the future.
Though single and double precision floating point numbers (32 and 64 bits) are perhaps of most interest in scientific sorting, smaller keys are also of interest when sorting pointers for example or in some physical applications . With the advent of the commercial use of supercomputers sorting will also be applied to problems where key sizes are very large. When sorting character strings, for example, several textual words, of some five bytes each, will not be uncommon. Thus the key sizes are varied from 8 to 256 bits in logarithmic steps.
The key value distribution is varied according to the 6 entropy values presented in . It should be noted that there are two possible interpretations of the word ``distribution.'' The first refers to the probability distribution of the values of the keys (e.g. Are low-valued keys more common than high-valued keys?). The second interpretation refers to the way in which the keys are physically placed initially in the memory (e.g. Are the keys already in sorted order? Are they in reverse sorted order?). This section refers to the first of these two interpretations.
For N 32-bit keys, there are (232 + N - 1) choose (232 - 1) possible key distributions . If there are one billion keys, this number is 101166738659. Obviously it would be impossible to characterize the sorting performance over any but a very small subset of these possibilities.
One technique which has often been used to characterize the distribution of data is entropy measurement. The Shannon entropy  of a distribution is defined as
where pi is the probability of symbol i occurring in the distribution. If the logarithm is base 2, the entropy of the key distribution specifies the number of unique bits in the key. For example, if every key had the same value (say 927), the entropy of the key distribution would be 0 bits. On the other hand, if every possible 32-bit key were represented the same number of times (i.e., a uniform distribution), the entropy of the keys would be 32 bits. In between these two extremes are entropies of intermediate values.
In many real world databases there will be fewer bits of entropy for a distribution than bits in the data structure representing the key. Customer account numbers are a good example of this. Often not all possible account numbers are used or it may be the case that certain prefix digits are used to organize the data. For example, a leading order digit of 1 in an account number might specify commercial customers while a leading order digit of 2 might specify individuals. No other leading order digits are allowed. Assuming an eight bit character representation of the digits, the 8 bits in the character are used to represent a 1 bit quantity.
The goal of this work is to evaluate sorting algorithms as the entropy of the key data is varied. To evaluate an algorithm, it is necessary to either measure the entropy of a test set or generate a test set with a specified entropy. We have chosen to generate key data which spans a range of entropy values. To accomplish this, there are many possible algorithms. One technique would be to simply take a uniform set of keys with 32 bits of entropy and zero out the leading order N bits. This would generate keys with (32 - N) bits of entropy. A more interesting technique would produce keys whose individual bits are between 0 and 1 bit of entropy. We propose one such technique here. Unquestionably there are other techniques which could also perform this task. We believe the technique we have developed is general enough to serve the desired purpose.
The basic idea is to combine multiple keys having a uniform distribution into a single key whose distribution is non-uniform. The combination operation we are proposing is the binary AND. For example, take two 32-bit keys generated using a uniform distribution (we will assume that the individual bits are independent and that the two keys are independent). In this case, each bit of the keys will have a .50/.50 chance of being either a zero or a one. If we AND these two keys together, each bit will now be three times as likely to be a zero as a one (.75/.25). This produces an entropy of .811 bits per binary digit for a total of 25.95 bits for the entire key (out of a possible 32 bits). If we repeat this process using additional uniform keys, the entropies of the key distributions continue to decrease: The difference between successive ANDings is approximately twenty percent of the total for the first five ANDings. The exact percentages (of 1 bit of entropy per binary digit) are as follows: 1 ANDing - 100%, 2 ANDings - 81%, 3 ANDings - 54%, 4 ANDings - 38%, and 5 ANDings - 20%. It then takes an infinite amount of additional work to decrease the entropy completely to zero. This can be effectively accomplished by setting all of the keys to the same value.
Though this entropy measure captures much of what we would like to notice about non-uniform distributions there are perhaps two common distributions that could be exploited by many algorithms but can not be easily constructed in this way. The first is what we will call a ``sparse distribution'' where only 8 bits of the 64 bits of the key are allowed to differ from zero. Thus only 256 different values will actually be obtained in the distribution though there is a possibility of 264. This distribution can be constructed by generating random 8 bit numbers from 0 through 255 and then inserting seven zeros between each bit to construct a full 64 bit key. This distribution can be exploited by certain radix algorithms that check for variance in each bit of the key before counting or that replace the histogram in the counting step with a hash table.
A further variation which is almost the same as the sparse distribution but which can not be exploited by radix as above would be if 99% of the values were sparse as above but the remaining 1% were random. This distribution can be generated by generating a random number between 0.0 and 1.0 and wherever this value is 0.99 or less generating the key via the above sparse distribution, elsewhere it can be generated from a random number from 0 to 264.
As stated previously, key allocation in memory both before and after the sort is difficult to define since it is architecture dependent. Having decided that ``sorted order'' corresponds to block order, it is much more difficult to say which layouts of the original data will or will not be difficult for the given architecture to permute. However, there are four cases that are relatively common and should be tried. They include the combination of block layout and cyclic layout for both presorted and reverse sorted data.
Table 1 summarizes the test cases described in the previous
subsections. The four parameters to be varied are listed across
the top, and the base case is listed in the first row. Below this
row there is a matrix of blocks where the off-diagonal blocks
simply specify base case parameters and the diagonal blocks
specify the parameter variation.
Table: Test cases for supercomputer sorting benchmark.
Sorting is important in both scientific and commercial applications of supercomputers and should be a standard benchmark. In specifying such a benchmark, variation in the size and distribution of sorted data is necessary to accurately measure how an algorithm/architecture pair performs. The ability of a supercomputer to manipulate large amounts of data will determine the future of supercomputers in non-numeric processing areas. Sorting is a prototypical data movement task which measures a number of important system performance characteristics, including communication bandwidth. To this end we have proposed a formal set of rules which can be used to define a rigorous sorting benchmark. This benchmark requires that large amounts of data be sorted (at least 100 million keys) while varying a number of other parameters (key size, data distribution).
These benchmarks can be applied to any of the currently existing supercomputer systems, from single processor vector systems to massively parallel processing systems. With future advances in memory technology, supercomputer systems will be able to process much larger amounts of data. The sorting benchmark will scale with increasing memory capacity to capture the non-numerical performance of future supercomputers.
The authors would like to thank Steve Heller and Mark Bromley for valuable discussions during the development of this paper.