Published in 1993 Lectures in Complex Systems, edited by L. Nadel and D. Stein, Addison-Wesley, 1995
by Kurt Thearling
With the recent development of massively parallel computing, extremely large amounts of processing power and memory capacity are available for the analysis of complex data sets. At the same time, the complexity and size of these data sets has been increasing. Both of these trends are expected to continue for the foreseeable future. This paper will provide a general overview of massively parallel architectures and algorithms for the analysis of time series data. Two distinct approaches to this problem, computational and memory-based, will be described.
The last decade has seen a revolution in large scale computation. The massively parallel processing (MPP) paradigm, originally seen as an outsider in the supercomputer race, is widely recognized as the technology of the future. In this paper we will discuss a number of approaches to time series data analysis using massively parallel computers. We will first review some of the current levels and trends in MPP technology and then discuss how this technology can be leveraged for data analysis. Two disparate approaches to data analysis will be investigated, memory-based and computational algorithms. We will conclude with some predictions for the (near) future for parallel computing.
In a massively parallel processing system, current levels of technology allow for
At this time, the largest installed MPP system in the world is a 1024 processor Connection Machine CM-5 (the system can scale to 16,384 processors). Each processor has a peak performance of 128 MFLOPS and has at least 32MB of memory. This equates to a peak performance for the machine of 128 GFLOPS and a total memory of 32 gigabytes. The total global communication bandwidth is 5 GB/sec.
One key to the advance of MPP technology is its scalability. If an application needs more MIPS or megabytes, additional processors can be added help solve the problem. If the system is designed intelligently, the overall performance of the system (global communication bandwidth, MIPS, MFLOPS, etc.) will scale up linearly with the system size. It should be noted, though, that the degree to which performance can be extracted from a MPP system is very algorithm dependent.
Undoubtedly the level of computing power available in a large MPP system will increase dramatically over time. Processor speeds and memory sizes are doubling approximately every eighteen months and this increase will be quickly adopted by MPP manufacturers. This means that the age of a Teraflop/Terabyte computer is not far off. Extremely large amounts of data will be able to be analyzed using this amount of processing power. This has changed the way that data analysis is carried out. The fact that large amounts of data are available has created the situation in which pure number crunching has, to some degree, given way to ``memory-based'' algorithms. This will be further discussed later in this paper.
One major development that is making MPP more usable is in the area of programming languages and tools. New programming languages such as Fortran 90 and C* have made the task of actually programming these machines much easier (actually they are extensions to existing programming languages, with the parallel architecture taken into consideration). In addition, tools are now available to help programmers design and debug their software. This is probably one of the most important advances, since traditionally the most difficult aspect of extracting performance from an MPP machine has been debugging the software.
There are two major types of parallel computers. The Single-Instruction-Multiple-Data (SIMD) architecture is characterized by the fact that each processor executes the same instruction simultaneously. Examples of this type of architecture are the Connection Machine CM-2 and the Maspar MP-1. In Multiple-Instruction-Multiple-Data (MIMD) computers, each processor operates autonomously and executes instructions independently of the other processors. Examples of this type of architecture are the Connection Machine CM-5, Intel Paragon, and Fujitsu VPP500.
A programming form related to SIMD is the data parallel programming paradigm. In the data parallel programming model, all data is operated on in parallel using the same process. But unlike SIMD, data parallel programming allows different instructions to be issued by different processors simultaneously. Data parallel code can be run on both SIMD and MIMD machines, with improved performance possible on MIMD machines. The major reason to create a data parallel algorithm is the fact that they are simpler to understand and program. Experience over the past decade has shown that many tasks are amenable to the data parallel programming model.
Besides the instruction execution style (SIMD/MIMD), parallel computers are defined by the type of memory system that they use. Shared memory machines are exactly what their name implies; the memory system is shared by all of the processors. The Cray supercomputers (but not their upcoming MPP) are examples of shared memory computers. Distributed memory architectures spread the memory out over the individual processors in the system. If one processor requires a piece of data located in another processor's memory, that data must be transmitted over an interconnection network that links processors together. The CM-5 and Intel Paragon are both distributed memory machines.
For distributed memory machines, the interconnection network between processors is a key component. The network is the medium over which memory from one processor is transmitted to another processor. When designing such a network, the goal is to maximize global communication bandwidth.
There are two extremes in the use of massively parallel computing power: memory-based algorithms and computation-based algorithms. Memory-based algorithms make use of the large storage capacity (both RAM and disk) of MPP systems but may use few of the FLOPS. Some of the more common processing tasks that correspond to this extreme are memory based reasoning, relational database operations, and text retrieval. Computation-based algorithms, on the other hand, make use of Gigaflop performance of parallel CPUs but may use little of the memory. Examples of this type of processing are numerical optimization, genetic algorithms, backpropagation (as well as some other neural network learning algorithms), and statistical model building. In the next two sections, we will discuss problem solving techniques spanning the range of these extremes.
Memory based reasoning (MBR)  is a form of K-Nearest Neighbor (KNN) classification technique . It differs from traditional KNN algorithms in a qualitative, not quantitative, way. Most successful previous applications of KNN made use of small (less than 10 megabytes) databases which were hand tuned to maximize accuracy. These applications were limited in the amount of data that could be used by their ability to quickly search for neighbors within the database. By applying the KNN approach to much larger databases (hundreds/thousands of megabytes), massively parallel computing transforms KNN into MBR. MBR applications rely on the ability to leverage the information contained in extremely large databases. Typical applications often involve hundreds of megabytes of data. In addition, data is often multidimensional, involving different types of information. There is little or no manual processing of the data (including the removal of incorrect training examples) before it is used.
A parallel nearest neighbor search is used to look at all training examples simultaneously. Distance metrics can be hand tuned to improve performance based on application specifics but simple measures can often produce very accurate results. Initial results can be quickly achieved since there are no models to create. Also, confidence levels can be generated using relative distances to matching and non-matching neighbors.
Some previously successful applications of MBR include:
Although these examples are not strictly time series problem, they do illustrate the potential for the analysis and prediction of very large amounts of data. One aspect of these problems that separates them from most time series analysis problems is the amount of data to be analyzed. The largest of the datasets in the SFI time series competition was approximately 600 kilobytes. Contrast this with data analysis problems which have previously been performed on the Connection Machine that involved hundreds of megabytes of data. Some applications are currently working with single databases on the order of tens of gigabytes, with the expectation that they will grow by a factor of ten in as little as two years. Even though the problems listed above may seem very different from time series forecasting, they actually involve similar techniques.
An example of an MBR approach from the area of time series analysis is the work of Farmer and Sidorowich [5, 6]. In their work, Farmer and Sidorowich attempted to predict the behavior of a time series generated by a chaotic system. Their training set consisted of a time series of up to 10,000 sampled points along the attractor. The time series was then transformed into a reconstructed state space using a delay space embedding [12, 18]. In the delay space embedding, each point in the state space is a vector X composed of time series values corresponding to a sequence of d delay lags: x1(t) = x(t), x2(t) = x(t-tau), ..., xd(t) = x(t - (d-1) tau). For a D dimensional attractor, d must be at least as large as D.
To forecast a time series value, they first transformed the value into the state space representation. The nearest k (> d) neighbors in the state space representation were then located. A local linear map was created for the k neighbors and applied to the value to be forecast. The result of the mapping was the predicted value. Although higher dimensional maps (quadratic, etc.) could be used, Farmer and Sidorowich did not find significant improvements over the linear map. Using this approach, they were able to forecast time series values for a number of systems (Mackey-Glass differential equation, Rayleigh-Benard convection, and Taylor-Couette flow) much more accurately than standard forecasting techniques (global linear autoregressive).
Another piece of related research is Atkeson's MBR approach to the approximation of continuous functions . In his work, Atkeson applied a locally weighted regression technique to the set of nearest neighbors to accurately predict the output of a continuous function.
As stated earlier, there have been several very large MBR applications implemented on a massively parallel computer. The following examples were both performed on a Connection Machine CM-2. Of particular importance is the large amount of data that was used to train the algorithms. It would have been very difficult (if not impossible) to make full use of such large data sets using a traditional computer architecture.
The first memory-based MPP algorithm involves optical character recognition . Optical character recognition is the problem of taking a bit-mapped array of pixels and correctly classifying it into an alphanumeric character category. For pixel arrays of size 128 by 128, the problem has 16384 dimensions. Smith and his colleagues  used 300,000 training examples to provide a knowledge base for an MBR system. This corresponds to 614 megabytes of data. Using very simple Hamming distance metrics, classifying an unknown character could be performed with an average accuracy of 99%. The technique also allowed for the introduction of concept of confidence by allowing the system to refuse to classify unknown characters whose nearest neighbors fell below a threshold distance. When the confidence measure was introduced, the system achieved 99.9% accuracy at 93% coverage (i.e., the system was not able to confidently classify 7% of the data).
Another application of MBR to large databases involved the classification of Census Bureau data . In this case the problem involved the classification of free-text responses to questions about a respondent's occupation. These responses needed to be classified by occupation category (504 types) and industry category (232 types). Approximately 130,000 training examples were used, corresponding to 15 megabytes of data. When compared to an existing expert system used by the census bureau, the MBR approach achieved a 57% improvement for the occupation classification. The MBR system also achieved a 10% improvement for the industry classification over the expert system.
As we have stated, it is necessary in MBR systems to locate the k-nearest neighbors for a point in the state space. A number of distance metrics can be used, including Hamming, Euclidean, and a host of others. For serial computers, a K-D Tree representation  can effectively reduce search complexity for the nearest neighbors when there is structure in data. But when there is little or unknown structure in data, searching all data elements in parallel may be the most effective solution .
In experiments on financial data (daily S&P 500 closing prices), we have compared several K-D tree algorithms with simple parallel search examining portions of all data points. The experimental data contained 6067 points (over 20 years worth of data) embedded into a five dimensional delay space. The distance metric was simply Euclidean distance. The total number of operations required to evaluate the distance between two points in the state space is n squaring operations, n subtraction operations, and n-1 addition operations. The square root operation is unnecessary since the ordering of the distances is the same as the ordering of the distance squared.
When the K-D tree algorithm attempted to locate the five nearest neighbors for a test point (which was another point from the time series that was removed from the training set), an average of 99.6% of the training set data needed to be examined. In a refinement of the K-D tree search algorithm, the search began in the leaf cell that the test data point mapped to. It was hoped that this would help by initializing the set of nearest neighbors to a good set of candidates and thereby allow the traversal of the tree be pruned subsequently in the search. This technique did improve the performance of the search but the improvement was not signification (an average of 99.5% of the training data was examined).
Finally, the K-D tree search was replaced by a much simpler technique. In this approach, every piece of data was examined to see if it was one of the nearest neighbors. But instead of computing the entire distance from the test point, the distance was computed incrementally. The (square of the) distance corresponding to each dimension was added until all of the dimensions were included. If at any time the partial distance was greater than the the (square of) the furthest of the current set of K-nearest neighbors, that data point was discarded. So, although each of the data points in the state space was evaluated, only a fraction of the entire evaluation (n squaring operations, n subtraction operations, and n-1 addition operations) was performed. In experiments on the same S&P 500 training data, this technique performed only 45.6% of the possible operations (squaring, subtraction, and addition) necessary to locate neighbors from the training data.
This incremental search technique can be efficiently implemented in parallel using a local nearest neighbor heap for each processor and updating the entries incrementally. After local neighbors are located, a global sort is used to find global neighbors. In addition to the overall efficiency of this approach, the use of multiple processors can result in significant increases in the search performance.
Genetic algorithms are an attempt to model the search for a problem solution as an evolutionary activity [7, 8]. Candidate solutions to a problem are encoded in a bit-string (e.g., phase space coordinates for a delay embedding). A population of these candidates (genomes) are individually evaluated to determine how well they solve the problem. Survival to the next generation is based on a candidate's ``fitness.'' Once survivors are found, mates are chosen for each genome. Genetic operations of crossover and mutation are then used to breed new solutions for next generation given solutions from current generation. Crossover is performed by choosing two points along the genome pairs and swapping the values in between. Mutation (at a user specified probability) is then considered for each bit in the resulting genomes. Fitness of new generation of genomes is then evaluated and entire process is repeated.
Packard  has previously used genetic algorithms to build predictive models for complex data sets (including, but not limited to, time series data). The goal of the evolutionary process is to locate predictable points within a state space of complex multidimensional data. A genome specifies points within the state space and the fitness measures the distribution of nearest neighbor data points. A distribution of neighbors corresponding to a tight cluster receives a high fitness while a distribution that is spread out receives a low fitness. Speciation techniques are used to keep the population of genomes from converging to a global optimum.
Each generation produces a set of data points/distributions along with their associated fitnesses. This locations of set of distributions with high fitness values correspond to locally predictable sections of the phase space. By continuing the evolutionary process, a set of high fitness distributions are produced. This set is then be used to generate predictive rules to model the system under investigation. Sections of the phase space that are not represented in the set of predictive rules are simply unpredictable (or they are sufficiently less predicable than the rules that evolved) and thus are not include in the model.
In a parallel implementation, each processor is assigned a genome to evaluate. If there are more genomes than processors, the genomes are evenly distributed over the set of processors. Initially the genomes are set to random bit patterns. Each processor determines the fitness of its genomes by evaluating each genome's ability to solve the problem at hand. If there is more than one genome per processor, each processor evaluated the genomes serially. The performance of fitness evaluation is very dependent upon application but linear speedup possible for parallel evaluation (this assumes that evaluation cost is constant regardless of genome specification).
Global communication is then used to rank the genomes based on fitness and then select them for survival into the next generation. After global selection, survivors are then replicated in proportion to (a function of) their fitness values. Next, each survivor chooses a mate at random Once mating is complete, new genomes are created using (two point) crossover and mutation. The cost (in time) of each generation dominated by evaluation; evaluation is usually much more expensive than breeding a new generation.
A genetic algorithm system has been implemented on the Connection Machine CM-5 . The user provides the representation of a candidate solution as a bit string (genome) and generates an evaluation function for the genome. This system then provides the basic evolutionary operators (selection, mating, crossover, and mutation) which are interfaced to the user's application dependent evaluation function. The current performance for a CM-5 is approximately 5 sec/bit (per processor) for breeding a new generation of genomes. For example, assume a population of 1024 genomes of length 4096 bits on a 128 processor system. This corresponds to approximately 0.16 seconds to breed a new generation after the fitness evaluation has been done. In most cases the users specified fitness evaluation will take considerably longer than the time taken to breed a new set of genomes.
A variety of neural network approaches have been applied to the forecasting of nonlinear time series. The work of Lapedes and Farber  as well as several of the papers included in  have demonstrated the ability of neural network models to successfully predict time series data. The work of Zhang and Hutchinson, contained elsewhere in these proceedings , describes the application of MPP neural network algorithms to time series analysis.
In this section we will review some of the work done in mapping neural network algorithms to massively parallel computers. For a more thorough description, see  and  and the references contained therein.
Neural network architectures are characterized by a collection of processing nodes connected by a series of weighted links. The relationship between the individual node's inputs and outputs is typically a nonlinear function (such as a sigmoid). By carefully choosing the weights for the links, a neural network can carry out complex mappings from global inputs to global outputs. The complicated issue in carrying this process out is in computing the interconnection weights. Algorithms such as backpropagation  are often used to perform this task.
To map a neural network architecture to a massively parallel processor, the first approach that comes to mind is simply map each node in the network to a processor. The connections between nodes map to messages between processors over the computers interconnection network. When the network is learning its connection weights, the inputs are passed forward through the network to the outputs. The observed outputs are then compared with the expected output and the differences are propagated backward from the outputs to the inputs. During the backpropagation phase, the weights are adjusted to minimize the error at the outputs.
Once the connection weights have been learned, the network is then run only in the forward direction. The performance of this process is measured in CPS (connections per second). Estimates for the performance of the forward direction of this naive implementation are 13 million CPS (for a 64K processor CM-2 ). This compares to a speed of on the order of 250 thousand CPS for a typical workstation. The performance of the learning algorithm will be twenty-five to fifty percent of the forward direction performance due to the fact that both a forward and backward pass are required during learning.
A more sophisticated approach to mapping a neural network architecture involves taking into consideration specific aspects of the MPP architecture. Zhang, et al.  used knowledge of the communication and computation details for the Connection Machine CM-2 when they designed their neural network algorithm. Instead of mapping a single network node to a processor, Zhang and his colleagues carefully mapped multiple network nodes to the same processor. This allowed for reduced communication cost and efficient computation of the network node outputs. In addition, they also performed learning in parallel by replicating entire copies of the neural network over the MPP processors. The performance that they achieved (on a 64K processor CM-2) was approximately 80 million CPS and 40 million WUPS (weight updates per second) during learning.
Finally, the fastest performance achieved to date for the implementation of a neural network on a MPP is the work of Farber . In that implementation, each processor contained a represented a training example for the neural network. The network was broadcast to each of the processors and training was done in parallel. Each of the individual backpropagation results were combined globally after each training phase. Since there is little communication involved, this algorithm can achieve very high performance: 325 million WUPS and 1.3 billion CPS (on a 64K processor CM-2). The only real disadvantage of this algorithm is that it needs an extremely large number of training examples to be efficient (at least as many examples as processors).
The integration of computational and memory-based problem solving techniques can sometimes be more successful than either of the techniques in individually. An example of an application of this type of approach is in the area of protein secondary structure prediction . In this work, a combination of neural network, MBR, and statistical approaches was used. The inputs to one neural network are the output of an MBR system and the output of another neural network. The accuracy of this hybrid system was better than accuracy of any individual system and was the overall accuracy was better than any previously published algorithm
There are two major trends in computational data analysis. First, available computational processing power is increasing dramatically over time. Supercomputers will ride the wave of faster chips and denser silicon to make TeraFLOP computing a reality around the year 1995. Also following this trend will be memory capacity, which will hit the TeraByte level about the same time. To simplify the task of programming these machines, the software tools and programming languages will become much more important than the hardware issues.
Second, the task that data analysis systems will face will center on dealing with the massive amounts of data, orders of magnitude larger than currently exists. Terabyte sized databases will soon become a reality. This will require that the bulk of the processing power in use will be spent analyzing data rather than generating it. Data visualization (the graphical display of complex data in order to reveal qualitative features) will be very important. ``Database mining'' is a recently coined term often used to describe techniques which allow users to sift through terabytes trying to locate useful bytes of data. Such techniques will be necessary if there is to be any
If carefully applied, the increase in computational power should help us offset the complexity of the data will need to be analyzed.
Massively parallel processing power and memory capacity is rapidly approaching the TeraFLOP/TeraByte level. This will allow users to explore two extremes in problem solution space: memory-based and computational algorithms.
Memory-based algorithms (such as MBR), which leverage the knowledge contained in very large databases, are extremely simple to implement and provide useful results quickly. Although MBR techniques are relatively new, they hold great hope for the future. By using massively parallel processing, these large databases can be generate efficient solutions to difficult problems.
Computation dominated techniques such as genetic algorithms and neural networks are also amenable to the massively parallel programming paradigm. Extremely high performance has been achieved for parallel implementations of GA and neural network algorithms. This performance can be used to efficiently carry out analysis on large, complex databases.
By combining computational power with large amounts of information, the next generation of MPP supercomputers will soon allow us to analyze large databases that had previously been beyond our comprehension.
The author would like to thank David Waltz, Stephen Smith, Xiru Zhang, and Jim Hutchinson for conversations, ideas, and support.
Connection Machine is a registered trademark of Thinking Machines Corporation. CM-2, CM-5, and CM are trademarks of Thinking Machines Corporation. C* is a registered trademark of Thinking Machines Corporation. Paragon is a trademark of the Intel Corporation.