Electronic submission for the 33rd Asilomar Conference on Signals, Systems, and Computers mailto:submitAS@ece.nps.navy.mil 1. Paper title Real-Time High-Throughput Sonar Beamforming Kernels Using Native Signal Processing and Memory Latency Hiding Techniques 2. Keywords 7. DSP hardware & software 8. Radar & sonar 5. Sensor array processing & data fusion 3. Authors and affiliations Mr. Gregory E. Allen Applied Research Laboratories The University of Texas at Austin P.O. Box 8029 Austin, TX 78713-8029 gallen AT arlut DOT utexas DOT edu Prof. Brian L. Evans Dept. of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712-1084 bevans AT ece DOT utexas DOT edu Prof. Lizy K. John Dept. of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712-1084 ljohn AT ece DOT utexas DOT edu 4,5. Point of contact (POC) Gregory E. Allen Applied Research Laboratories The University of Texas at Austin P.O. Box 8029 Austin, TX 78713-8029 (512) 835-3487 (voice) (512) 835-3259 (fax) gallen AT arlut DOT utexas DOT edu https://gregoryeallen.github.io/ http://www.ece.utexas.edu/~allen/ 6. Paper abstract We evaluate the use of native signal processing with loop unrolling and software prefetching to achieve high-performance digital signal processing on general-purpose processors. We apply these techniques to minimize the number of processors necessary for real-time implementation of a 3-D sonar beamformer. Because our beamforming kernels operate on high-throughput (~100 MB/s) input/output streams, memory latency hiding techniques are key for maximum performance. On the Sun UltraSPARC-II processor, we find speedups of 2.4 for hand loop unrolling, 1.46 for the Visual Instruction Set over floating-point arithmetic in C, and 1.33 for software prefetching. 7. Extended summary Modern general-purpose processors support high-performance digital signal processing on workstations, e.g. through the use of native signal processing (NSP) [1-3]. Single-cycle multiply-accumulates (MACs), which have traditionally been available only on digital signal processors (DSPs), are now commonly available in general-purpose processors. Several manufacturers have also added single-instruction multiple-data extensions to enhance performance in multimedia applications, e.g. the Visual Instruction Set (VIS) in the Sun Microsystems UltraSPARC processor [2,4]. Sonar beamforming and synthetic aperture radar imaging systems are high-throughput applications that require billions of MACs/s and 100 MB/s of input/output data. For such high-throughput applications, real-time implementations were initially in custom hardware and more recently in commercial off-the-shelf (COTS) components [5]. NSP makes a real-time implementation on commodity multiprocessor workstations possible at a fraction of the development and manufacturing costs of custom hardware and COTS solutions [6]. To demonstrate the use of NSP for real-time implementations, we evaluate two key time-domain kernels for 3-D sonar beamforming: a horizontal digital interpolation beamformer [7] and a multi-fan vertical beamformer. The vertical kernel accepts 16-bit data arriving at 100 MB/s, and performs integer-to-floating-point conversion and multiple dot products. We compare floating-point and VIS realizations. The horizontal kernel accepts 32-bit single-precision floating-point data, and performs index lookup and digital interpolation. We implement it in single-precision floating-point arithmetic. These kernels represent the computational bottlenecks in a 3-D sonar system. VIS is specifically optimized for graphics and image processing, but should also allow substantial performance for fixed-point 1-D signal processing. VIS treats a 64-bit register as 2, 4, or 8 partitioned data words, and performs operations on multiple words with a single instruction. For this paper, we require the highest precision (and slowest) of the VIS modes for the vertical beamformer -- signed 16-bit by 16-bit multiplies with a signed 32-bit accumulator. The peak performance for this mode is one MAC operation every two clock cycles. Although this is half of the peak floating-point performance, it is much faster than converting the sensor data to floating-point and performing the floating-point operations. Even with processor NSP performance advances, access to high-latency main memory can cause the processor to stall. The stream-oriented nature of real-time signal processing algorithms amplifies this problem because of the high data throughput to memory. Memory latency hiding techniques [8] such as loop unrolling, software pipelining, and software prefetching [9-11] can be used to help alleviate this bottleneck. Because these beamforming kernels involve large, continuous streams of data, memory latency hiding techniques are necessary for maximum performance. We use Sun's SPARCompiler5.0 Developer Release to compile the kernels. We also utilize several performance evaluation tools from Sun to provide useful insights into kernel performance. These tools include Shade instruction simulators, INCAS (It's a Near Cycle Accurate Simulator), and the UltraSPARC performance instrumentation counters. We explore several different kernel implementations in the design space and compare performance results on a 336 MHz UltraSPARC-II processor. In the horizontal beamformer, we find that hand-coded loop unrolling gives a speedup of 2.4. This kernel achieves 444.3 MFLOPS, which is 66% of peak. In the vertical beamformer, we find that VIS offers a 46% performance boost over a floating-point implementation, and that software prefetching gives an additional 34% performance gain by reducing both load and store stalls. For the vertical beamformer, we achieve 313.3 MIOPS, which is 93% of peak. Kernel | Performance | Data rates (MBytes/sec) -----------|--------------|--------------------------- Vertical | 313.3 MIOPS | 104.4 input, 62.7 output Horizontal | 444.3 MFLOPS | 11.65 input, 8.89 output We find that horizontal beamforming kernel is computationally limited whereas the vertical beamforming kernel is input/output limited. We find that loop unrolling as a means for memory latency hiding and increased instruction-level parallelism provides excellent results in both kernels, and that the use of VIS and software prefetching can result in excellent performance gains. However, prefetch instructions are not automatically generated, and low-level programming with VIS or prefetch instructions can be difficult and time consuming. The specific contributions of this paper follow: - An evaluation of native signal processing for real-time implementation of a high-throughput application - A demonstration of software prefetching with high-throughput data (not a simulation) - A case study that shows achievable near-peak performance - A comparison of fixed-point native signal processing vs. floating-point arithmetic in a general-purpose processor - An evaluation of current compiler and performance evaluation tools - Freely distributable real-time beamforming kernels [1] P. Lapsley, "NSP Shows Promise on Pentium, PowerPC," MicroProcessor Report, 1995. [2] W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, "Native Signal Processing on the UltraSparc in the Ptolemy Environment," Proc. IEEE Asilomar Conf. on Signals, Systems, and Computers, Nov. 1-4, 1996, vol. 2, pp. 1368-1372. [3] R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, "Evaluating MMX Technology Using DSP and Multimedia Applications," Proc. IEEE Int. Sym. on Microarchitecture, Nov. 30-Dec. 2, 1998, Dallas, TX, pp. 37-46. [4] Visual Instruction Set User's Guide, Sun Microsystems, 1995. [5] K. Watkins, "Optimal Architectures for Massively Parallel Implementation of Hard Real-Time Beamformers", Proc. IEEE Asilomar Conf. on Signals, Systems, and Computers, Nov. 1-4, 1998. [6] G. Allen, B. L. Evans, and D. C. Schanbacher, "Real-Time Sonar Beamforming on a Unix Workstation Using Process Networks and POSIX Pthreads", Proc. IEEE Asilomar Conf. on Signals, Systems, and Computers, Nov. 1-4, 1998. [7] R. G. Pridham and R. A. Mucci, "A Novel Approach to Digital Beamforming," Journal Acoustical Society of America, vol. 63, no. 2, pp. 425-434, Feb. 1978. [8] L. John, V. Reddy, P. Hulina, and L. Coraor, "A Comparative Evaluation of Software Techniques to Hide Memory Latencies," Proc. IEEE Hawaii Int. Conf. on System Sciences, vol. I, pp. 229-238, Jan. 1995. [9] D. Callahan, K. Kennedy, and A. Porterfield, "Software Prefetching," Proc. ACM Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 40-52, Apr. 1991. [10] K. Bala, M. F. Kaashoek, and W. E. Weihl, "Software Prefetching and and Caching for Translation Lookaside Buffers," USENIX Sym. on Operating Systems Design and Implementation, Nov. 1994, pp. 243-253. [11] F. J. Sanchez and A. Gonzalez, "Software Prefetching for Software Pipelined Loops," Proc. Hawaii Int. Conf on. System Sciences, vol. 7, pp. 778-9, Jan. 1998.