But an application that really pushes real-time filtering to its limits is interactive audio rendering for virtual acoustic reality.

Here, virtual scenes which consist of a multitude of sound sources are auralized in real-time. This is done by filtering the audio signals of the sound sources with individual impulse responses (filters) that model the sound propagation through the scene. The objective is a high-quality acoustic image of the scene. Room acoustic effects — like reverberation — need to be simulated precisely.

Users (listeners) shall be able to interact with the presented scene.

Consequently, the sound propagation changes over time and filters need to be exchanged. Furthermore, reactions on user input (e.g.

movement, rotation) must be reproduced instantaneously. Only very short processing delays are acceptable. Signal processing in this domain is a massive task: A large number of channels (typically 10-100) has to be filtered with individual filters which are long (typically room impulse responses of 30.000 – 300.000 filter coefficients). Input-to-output latencies as well as latencies for the exchange of filters need to be very low (<20 ms). Despite the fact that recursive filters (IIR filters) require significantly less computational effort, they are not favored for interactive audio rendering including room acoustics.

Today efficient fast convolution algorithms exist and very powerful hardware is available, which allows to perform FIR filtering for the auralization.

While in former times such an extensive filtering could only be realized using specialized DSP boards, nowadays computer hardware is often used. Boosted by parallel hardware, like multi-CPU/multi-core machines, current computer systems are able to deliver the computation power for scenarios like introduced above. However, there is always the desire to create more realistic scenes, with more sound sources and more complex scene topologies, demanding more computational power for the real-time filtering. And of course the filtering might not be the only task running on the system. This is where graphic cards become interesting.

Modern graphic processing units (GPUs) easily outperform current multi-core CPUs by means of sheer floating-point performance. This is achieved by a massive level of inherent parallelism (several hundred individual stream processors (SPs) on a single graphic chip) and a more focussed and thereby simpler hardware architecture compared to general purpose CPUs. Using the GPU for general purpose calculations (popular by the synonym GPGPU) became possible with the arrival of programmable shaders.

Back then software development for graphic hardware was a tedious process. Today high-level programming interfaces (APIs), like NVIDIA’s Compute Unified Device Architecture (CUDA) and ATI’s Stream technology (formerly Close-To-Metal), make the development much easier. However, GPUs do not make CPU computing dispensable. The enormous computation power can only be unleashed, if algorithms meet the specific characteristics of the graphic hardware. Most suited for GPGPU computing are so-called stream algorithms, which perform the same set of operations on large data sets. Real-time FIR filtering falls into this class of algorithms. It has a high potential for parallelization and a low level of data interdependency. This makes it an ideal candidate for GPGPU computation.

### Fast Convolution Filters

Fast convolution as a method for efficient FIR filtering has been researched for more than four decades. Several fast convolution

algorithms are known today. We found that the choice of algorithm is even more important when considering GPGPU computation.

Therefore we first give a brief overview on the methods and discuss their pros and cons. Afterwards we introduce our chosen

algorithm.

All fast convolution algorithms have in common, that they calculate linear convolution efficiently in the frequency-domain, by simple multiplication of discrete Fourier spectra, known as circular convolution. The term fast is reasoned by the Fast Fourier Transform (FFT) used to convert between the time- and frequencydomain.

The original idea was proposed by Stockham in 1966. His algorithm uses one FFT to convolve two signals (M,N samples). The length of the FFT is chosen so that the convolution result (M + N − 1 samples) does not exceed it and time-aliasing is avoided. This algorithm outperforms time-domain filtering (direct-form FIR filters, TDLs) by several magnitudes.

However it has the disadvantage of an input-to-output latency that equals the FFT-length. Moreover its efficiency drops when long signals are convolved with short ones (many ineffective zeros are processed). These problems can be tackled by choosing a shorter FFT-length and by processing the input data in several steps —

either in overlap-add or overlap-save fashion. Still, the FFTlength is at least as long as the filter impulse response and so is the latency.

For real-time filtering with long filters, the filter also needs to be partitioned. This allows to freely choose FFT-lengths and thereby to adjust the latency. Two variants are known: In uniformly partitioned fast convolution, filters are subdivided into several subfilters of equal lengths. The overall output is assembled

from all subfilter outputs, which need to be delayed accordingly. Kulp demonstrates, how the number of required FFTs/IFFTs can be reduced to one, when delays and sums are implemented directly in the frequency-domain.The uniformly partitioned fast convolution is also most efficient for offline processing.

Here, the FFT-lengths can be optimized in order to minimize the computational effort. This algorithm is designed for efficient convolution of long filters (> 1000 filter coefficients) with a short input-to-output delay. Short subfilters are used to minimize the latency, whereas longer subfilters reduce the overall computational

effort. It can be shown that this algorithm is even significantly more efficient than the uniformly partitioned variant.

But a non-uniform partitioning has also drawbacks: As opposed to a uniform partitioning, the complete filter cannot be

exchanged with every processed block. This introduces additional filter exchange latencies.

### Fast Fourier Transforms

As explained in the introduction, the Fast Fourier Transform (FFT) is a keystone for frequency-domain FIR filtering. Large-scale realtime filtering applications with a multitude of individual channels and with long filters, account to a considerable number of FFTs and Inverse Fast Fourier Transforms (IFFTs) . But all of these

transforms have the small transform size 2·B, of two times the block length. A typical 256-point FFT is a rather simple operation and does not include too many arithmetical operations. Its single-threaded execution on one CPU-core/-thread of the test system is very fast and takes just 1,2 µs. Such single small FFTs cannot be

efficiently parallelized on the GPU. But CUDA supports the calculation of multiple equally-sized FFTs, we refer to as Batch-FFTs.

These are exactly what is needed in the uniformly-partitioned convolution algorithm. Hence it is very interesting to see where they can be carried out the fastest — on the CPU or GPU.

From these results we conclude that, for a reasonable number of channels (<128), the FFT-/IFFT-transforms for the streamed input and output data should be calculated on the CPU. The filter processing usually involves much more transforms. Computation on the GPU is beneficial here. These rules only apply for filtering tasks with long filters and consequently a large number of filter parts. In general, it is advised to consider the given filtering circumstances.