Parallel Execution of the Saturated Reductions
Benoît Dupont de Dinechin, Christophe Monat, Fabrice Rastello

To cite this version:

HAL Id: hal-02101824
https://hal-lara.archives-ouvertes.fr/hal-02101824
Submitted on 17 Apr 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Parallel Execution of the Saturated Reductions

Benoît Dupont de Dinechin
Christophe Monat
Fabrice Rastello

July 2001

Research Report № 2001-28
Parallel Execution of the Saturated Reductions

Benoît Dupont de Dinechin
Christophe Monat
Fabrice Rastello

July 2001

Abstract

This report addresses the problem of improving the execution performance of saturated reduction loops on fixed-point instruction-level parallel Digital Signal Processors (DSPs). We first introduce “bit-exact” transformations, that are suitable for use in the ETSI and the ITU speech coding applications. We then present “approximate” transformations, the relative precision of which we are able to compare. Our main results rely on the properties of the saturated arithmetic.

Keywords: Saturated Reductions, Fixed-Point Arithmetic, Fractional Arithmetic, ETSI / ITU Speech Coding, Parallel Reductions, Instruction Level Parallelism

Résumé

Ce rapport traite de la parallélisation des réductions saturées sur les processeurs DSP à arithmétique virgule fixe. Nous proposons dans un premier temps des transformations “bit-exactes”, appropriée aux applications de codage de parole ETSI et ITU. Puis nous présentons plusieurs transformations “approchées”, que nous comparons. Nos principales contributions sont basées sur les propriétés de l’arithmétique saturée.

Parallel Execution of the Saturated Reductions

Benoît Dupont de Dinechin *
Christophe Monat †
Fabrice Rastello ‡

July 2001

1 Introduction

The latest generation of fixed-point Digital Signal Processors (DSPs), that comprises the Texas Instruments C6200 and C6400 series [8], the StarCore SC140 [6], and the STMicroelectronics ST120 [7], rely on instruction-level parallelism exploitation, and fractional arithmetic support, in order to deliver high-performance at low cost on telecommunication and mobile applications.

In particular, the instruction set of these so-called DSP-MCUs [3] is tailored to the efficient implementation of standardized speech coding algorithms such as those published by the ITU (International Telecommunication Union) [5], and the ETSI (European Telecommunication Standards Institute) [4]. The widely used ETSI / ITU speech coding algorithms include:

**ETSI EFR-5.1.0** The Enhanced Full Rate algorithm, used by the European GSM mobile telephone systems.

**ETSI AMR-NB** The Adaptable Multi Rate Narrow Band, that will be used in the 3GPP/UMTS mobile telephone systems.

**ITU G.723.1, ITU G.729** Speech coders used in Voice over IP (VoIP), Voice over Network (VoN), and for H.324 and H.323 applications.

In the ETSI and the ITU reference implementations of these algorithms, data are stored as 16-bit integers under the $Q_{1.15}$ fractional representation, and are operated as 32-bit integers under the $Q_{1.31}$ fractional representation. The $Q_{n,m}$ fractional representation interprets a $n + m$ bit integer $x$ as [3]:

$$Q_{n,m}(x) = -2^{n-1}x_{n-1} + \sum_{k=0}^{n-2+m} 2^{k-m}x_k$$

In other words, $Q_{n,m}$ is the two's complement integer representation on $n + m$ bits, scaled by $2^{-m}$, so the range of a $Q_{n,m}$ number is $[-2^{n-1}, 2^{n-1} - 2^{-m}]$.

Two’s complement addition of $Q_{n,m}$ numbers yields a $Q_{n+1,m}$ number. Two’s complement multiplication of $Q_{n,m}$ numbers yields a $Q_{2n,2m}$ number with the two upper bits identical except in the $-2^{n-1} \times -2^{n-1} = 2^{2n-2}$ case. This multiplication result fits into a $Q_{2n-1,2m+1}$ number, after saturation of $2^{2n-2}$ to $2^{2n-2} - 2^{-2m-1}$. In particular, the $Q_{1,m}$ representations ($n = 1$) are widely used because their multiplication exactly fits $Q_{1,2m+1}$ numbers, except in the $-1 \times -1$ case where the relative saturation error is $2^{-2m-1}$.

Because the ETSI and the ITU reference implementations operate on the $Q_{1.15}$ and the $Q_{1.31}$ fractional representations, a saturation is involved every time two $Q_{1,31}$ numbers are added into a $Q_{1,31}$ number, and every time two $Q_{1,15}$ numbers are multiplied into a $Q_{1,31}$ number.

---

*STMicroelectronics. 12 rue Jules Horowitz, BP217, F-38019 Grenoble Cedex France. Benoit.Dupont-de-Dinechin@st.com
†STMicroelectronics. 12 rue Jules Horowitz, BP217, F-38019 Grenoble Cedex France. Christophe.Monat@st.com
‡STMicroelectronics. 12 rue Jules Horowitz, BP217, F-38019 Grenoble Cedex France. Fabrice.Rastello@ens-lyon.fr
\[ s_0 := s \]
for \( i = 0 \) to \( n - 1 \)
\[
  s_{i+1} := \text{saturate}(s_i + \text{saturate}(x[i] \times y[i] \ll 1))
\]
end for
return \( s_n \)

Saturated fractional MAC reduction

\[
\begin{align*}
  \text{saturate}(s) & \overset{\text{def}}{=} \begin{cases} 
    \text{if } s > \text{max32} & \text{then max32} \\
    \text{else if } s < \text{min32} & \text{then min32} \\
    \text{else } s 
  \end{cases} \\
  \text{max32} & \overset{\text{def}}{=} 2^{31} - 1 \\
  \text{min32} & \overset{\text{def}}{=} -2^{31}
\end{align*}
\]

Saturated additive reduction

\[
\begin{align*}
  s_0 := s \\
\end{align*}
\]
for \( i = 0 \) to \( n - 1 \)
\[
  s_{i+1} := \text{saturate}(s_i + a[i])
\]
end for
return \( s_n \)

Figure 1: Saturated reductions: motivating examples.

Let the \( \text{saturate} \) operator be defined on a 32-bit number \( s \) interpreted as a two’s complement integer:

In the ETSI and the ITU vocoders, a significant percentage of the computations is spent on \textit{saturated reductions} loops like those displayed in figure 1. In this figure, \( s \) and \( a[i] \) are 32-bit \( Q_{1,31} \) numbers, while \( x[i] \) and \( y[i] \) are 16-bit \( Q_{1,15} \) numbers. Subscripted symbols like \( s_i \) are temporary variables indexed by the loop iteration number \( i \), in order to reference their value in proofs. In real programs, such temporaries would be mapped to a single variable.

In the saturated fractional MAC reduction code of figure 1, the \( x[i] \times y[i] \) product is a 32-bit \( Q_{2,30} \) number, which is first shifted left one bit in order to align the fractional point. The result is a 33-bit \( Q_{2,31} \) number, which is saturated back to a 32-bit \( Q_{1,31} \) number. This value resulting from \( \text{saturate}(x[i] \times y[i] \ll 1) \) is then added to \( s \), to yield a 33-bit \( Q_{2,31} \) number, which is saturated again to 32-bit, yielding the new value of \( s \) under the \( Q_{1,31} \) representation.

In order to efficiently execute speech coding applications under the requirement of “bit-exactness” with the the ETSI and the ITU reference implementations, a DSP must support fractional arithmetic in its instruction set. As a matter of fact, today’s leading fixed-point DSPs offer \textit{multiply-accumulate} (MAC) instructions, including fractional multiply-accumulates that compute \( s := \text{saturate}(s + \text{saturate}(x[i] \times y[i] \ll 1)) \) and its variants in a single cycle.

On the new DSP-MCU, exposing more instruction-level parallelism than a single MAC per cycle is required in order to reach the peak performances. Indeed, these processors are able to execute two (TI C6200, STM ST120) to four (TI C6400, StarCore SC140) MACs per cycle. Unfortunately, saturated additive reductions are neither commutative nor associative, unlike modular integer additive reductions. Therefore, a main issue when optimizing the ETSI and the ITU reference implementations onto a particular DSP-MCU is to expose instruction-level parallelism on saturated reduction loops.

In this report, we discuss several techniques that enable the parallel execution of saturated reduction loops on the new DSP-MCU. In section 2, we present the “bit exact” transformations that are suitable for use in the ETSI / ITU speech coding algorithms, but require a 4-MAC DSP. In section 3, we study the relative correctness of approximate transformations of saturated reduction loops that are commonly used on 2-MAC DSPs.

2 Bit-Exact Parallel Saturated Reductions

2.1 Compiler Optimization of the ETSI and the ITU Codes

In the ETSI and the ITU reference implementations of speech coding algorithms, all the \( Q_{1,15} \) and \( Q_{1,31} \) arithmetic operations are available as the functions known as the \textit{basic operators} (files \texttt{basicop2.c} and \texttt{basic_op.h} in the ETSI codes). Among the basic operators, the most heavily used are (ordered by decreasing dynamic execution counts measured on the ETSI EFR-5.1.0):

\[ s_0 := s \]
for \( i = 0 \) to \( n - 1 \)
\[
  s_{i+1} := \text{saturate}(s_i + a[i])
\]
end for
return \( s_n \)
for (i = 0; i < lg; i++) {
    s = L_mult(x[i], a[0]);
    for (j = 1; j <= M; j++) {
        s = L_mac(s, a[j], x[i - j]);
    }
    s = L_shl(s, 3);
    y[i] = round(s);
}

(residu)

for (n = 0; n < L; n++) {
    s = 0;
    for (i = 0; i <= n; i++) {
        s = L_mac(s, x[i], h[n - 1]);
    }
    s = L_shl(s, 3);
    y[n] = extract_h(s);
}

(convolve)

\[
\begin{align*}
\text{add}(x, y) & \equiv \text{saturate}(x + y) \gg 16 \\
\text{L\_mac}(s, x, y) & \equiv \text{saturate}(s + \text{saturate}(x \times y \ll 1)) \\
\text{mult}(x, y) & \equiv \text{saturate}(x \times y \ll 1) \gg 16 \\
\text{L\_msu}(s, x, y) & \equiv \text{saturate}(s - \text{saturate}(x \times y \ll 1)) \\
\text{L\_mult}(x, y) & \equiv \text{saturate}(x \times y \ll 1) \\
\text{round}(s) & \equiv \text{saturate}(s + 2^{15}) \gg 16
\end{align*}
\]

In these expressions, \(x\) and \(y\) are 16-bit numbers under the \(Q_{1.15}\) representation, while \(s\) and \(t\) are 32-bit numbers under the \(Q_{1.31}\) representation.

When porting an ETSI / ITU reference implementation to a particular DSP, the first step is to redefine the basic operators as intrinsic functions, that is, functions known to the target C compiler, and inlined into one or a few instructions of the target processor. Efficient inlining is compiler challenging, as virtually all the ETSI / ITU basic operators have a side-effect on a C global variable named \texttt{overflow}, which is set whenever \texttt{saturate} effectively saturates its argument. Compiler data-flow analysis is used to isolate the reductions whose side-effects can be safely ignored.

Precisely, on the ST120 DSP-MCU processor by STMicroelectronics [7], the side-effects of hardware saturation and integer overflow are accumulated into sticky status registers\(^1\). The main optimization that the C compiler performs after mapping the ETSI / ITU basic operators to the ST120 instructions, is to dead-code eliminate the useless transfers of values between the \texttt{overflow} variable, and the ST120 sticky status registers.

Once efficient inlining of the ETSI / ITU basic operators is achieved, the performance bottlenecks are identified in order to trigger more aggressive compiler optimizations. On the ETSI / ITU speech coding algorithms, many of these bottlenecks involve saturated reduction loops. Typical examples of such loops, from the EFR-5.1.0 vocoder, are illustrated in figure 2. In these cases, parallel execution can be achieved without introducing any overhead, thanks to the \texttt{unroll-and-jam} compiler optimization.

Unroll-and-jam [1] can be described as outer loop unrolling, followed by the loop fusion of the resulting inner loops. The main issues with this transformation are checking the validity of the inner loop fusion, and dealing with iteration bounds of the inner loop that are outer loop variant. Unroll-and-jam of the codes of figure 2 is illustrated in figure 3. In the \texttt{residu} code, the compiler Inter-Procedural Analysis (IPA) infers that \(1g\) is even, and that \(y\) does not alias \(a\) or \(x\). Likewise in the \texttt{convolve} code, where the IPA infers that \(L\) is even, and that \(y\) is alias-free. Thanks to these informations, remainder code for the outer loop unrolling is avoided, and the inner loop fusion is found legal. Unlike \texttt{residu}, the \texttt{convolve} loop nest has a triangular iteration domain, so its inner loop fusion generates residual computations.

Unroll-and-jam of saturated reduction loops is not always an option. Either the memory dependencies carried by the outer loop prevent the fusion of the inner loops after unrolling, such as in the \texttt{syn\_filt} code found in the ETSI EFR-5.1.0 and the ETSI AMR-NB. Or there are no outer loops suitable for unrolling. In such cases, parallel execution can still be achieved, by using the arithmetic properties of the saturated additive reduction.

\(^1\)Registers that are only set as side-effect of instructions, and that must be reset with a dedicated instruction. Because of the stickyness property, updates to these status registers can proceed in parallel.
for (i = 0; i < lg/2; i++) {
  short i_e = 2 * i;
  short i_o = 2 * i + 1;
  int s_e = L_mult(x[i_e], a[0]);
  int s_o = L_mult(x[i_o], a[0]);
  for (j = 1; j <= M; j++) {
    s_e = L_mac(s_e, a[j], x[i_e - j]);
    s_o = L_mac(s_o, a[j], x[i_o - j]);
  }
  s_e = L_shl(s_e, 3);
  s_o = L_shl(s_o, 3);
  y[i_e] = round(s_e);
  y[i_o] = round(s_o);
}

(residu)

for (n = 0; n < L/2; n++) {
  short n_e = 2 * n;
  short n_o = 2 * n + 1;
  int s_e = 0;
  int s_o = 0;
  for (i = 0; i <= n_e; i++) {
    s_e = L_mac(s_e, x[i], h[n_e - i]);
    s_o = L_mac(s_o, x[i], h[n_o - i]);
  }
  s_o = L_mac(s_o, x[n_o], h[0]);
  s_e = L_shl(s_e, 3);
  s_o = L_shl(s_o, 3);
  y[n_e] = extract_h(s_e);
  y[n_o] = extract_h(s_o);
}

(convolve)

Figure 3: Saturated reductions: after unroll-and-jam.

\[ u_0 := s \]
\[ v_0 := 0 \]
\[ min_0 := \text{min32} \]
\[ max_0 := \text{max32} \]
\[ \text{for } i = 0 \text{ to } \frac{n}{2} - 1 \]
\[ u_{i+1} := \text{sat}(u_i + a[i]) \]
\[ v_{i+1} := v_i + a[i + \frac{n}{2}] \]
\[ min_{i+1} := \text{sat}(min_i + a[i + \frac{n}{2}]) \]
\[ max_{i+1} := \text{sat}(max_i + a[i + \frac{n}{2}]) \]
\[ \text{end for} \]
\[ \text{return } \text{clip}_{min_0}^{max_0}(u_0 + v_0) \]

Program 1

\[ \text{int } u = s; \]
\[ \text{long } v = 0; \]
\[ \text{int } \text{min} = \text{INT_MIN}, \text{max} = \text{INT_MAX}; \]
\[ \text{for } (i = 0; i < n/2; i++) \}
\[ u = \text{L_mac}(u, x[i], y[i]); \]
\[ v += \text{L_mult}(x[i+n/2], y[i+n/2]); \]
\[ \text{min} = \text{L_mac}(\text{min}, x[i+n/2], y[i+n/2]); \]
\[ \text{max} = \text{L_mac}(\text{max}, x[i+n/2], y[i+n/2]); \]
\[ v += u; \]
\[ \text{if } (v > \text{max}) \text{ return max; } \]
\[ \text{if } (v < \text{min}) \text{ return min; } \]
\[ \text{return } v; \]

Program 2

Figure 4: Saturated reductions: parallelized code with one 40-bit accumulation.

2.2 Exploitation of a 4-MAC DSP with 40-bit Accumulators

We now introduce a first technique, based on the arithmetic properties of the saturated additive reduction, that enables the “bit-exact” parallel execution of the saturated reductions. This technique requires a DSP that executes four MAC per cycle, to achieve an effective throughput of two iterations of the original saturated reduction loop per cycle. The pseudo-code and the C code that implement this technique are displayed in figure 4. Program 2 assumes the data-type mapping of the TI C6000 and the STM ST120 C compilers: long integers are 40-bit, integers are 32-bit, and short integers are 16-bit.

Let us first consider the two programs in figure 5, that compute a saturated reduction over \( n \) values \( a[i] \). The \text{clip} operator is such that \text{sat} \( u \equiv \text{clip}_{min_0}^{max_0}(u) \):

\[
\text{clip}_h(s) = \text{sat} \begin{cases} 
  1 & \text{if } l > h \\
  s & \text{if } s > h \\
  h & \text{if } s < h \\
  s & \text{else}
\end{cases}
\]

**Theorem 1** In figure 5, if \( min_0 \leq s \leq max_0 \), then Program 3 and Program 4 compute the same result.
$s_0 := s$
for $i = 0$ to $n - 1$
  $s_{i+1} := \text{saturate}(s_i + a[i])$
end for
return $s_n$

Program 3


$S_0 := s$
$\min_0 := \ldots$
$max_0 := \ldots$
for $i = 0$ to $n - 1$
  $S_{i+1} := S_i + a[i]$
  $\min_{i+1} := \text{saturate}(\min_i + a[i])$
  $\max_{i+1} := \text{saturate}(\max_i + a[i])$
end for
return $\text{clip}_{\min_0}^{\max_0}(S_n)$

Program 4

Figure 5: Saturated reductions: equivalent codes when $\min_0 \leq s \leq \max_0$.

Figure 6: Possible case transitions between the induction steps $i$ and $i + 1$ of Theorem 1.

The proof is done by induction on the iteration index $i$, under the induction hypothesis that $s_i$ in Program 3 equals $\text{clip}_{\min_i}^{\max_i}(S_i)$ in Program 4. This induction hypothesis is actually equivalent to the four following cases:

1. $S_i \leq \min_i = s_i < \max_i$
2. $\min_i \leq s_i = S_i \leq \max_i$ and $\min_i \neq \max_i$
3. $\min_i < \max_i = s_i \leq S_i$
4. $\min_i = s_i = \max_i$

Because $\min_0 \leq S_0 = s \leq \max_0$, $\text{clip}_{\min_0}^{\max_0}(S_0) = S_0 = s$, while $s_0 = s$, so the induction hypothesis is verified for $i = 0$. The proof is completed by applying Lemma 1 to each of the four cases above, as summarized in Figure 6.

Lemma 1 Let $x$ and $y$ be two numbers.
If $x \leq y$ then $\text{min32} \leq \text{saturate}(x + a[i]) \leq \text{saturate}(y + a[i]) \leq \text{max32}$
If $x \leq y$ and $a[i] \leq 0$ then $x + a[i] \leq \text{saturate}(y + a[i])$.
If $x \leq y$ and $a[i] \geq 0$ then $\text{saturate}(x + a[i]) \leq y + a[i]$.

The proof follows from the definition of saturate.

Corollary 1 Program 1 and program 3 compute the same result.
Let us introduce the following notations:

\[
\begin{aligned}
\bigoplus_{i=0}^{m-1} (s, a[i]) &:= \begin{cases} 
  s_0 := s & \text{for } i = 0 \text{ to } m - 1 \\
  s_{i+1} := \text{saturate}(s_i + a[i]) & , \text{ and } \bigoplus_{i=0}^{m-1} (a[i]) \equiv \bigoplus_{i=0}^{m-1} (0, a[i]) \\
\end{cases}
\end{aligned}
\]

end for

return \(s_m\)

At the Program 1 end for control point, we have: \(u_{\mathbb{W}} = \bigoplus_{i=0}^{n-1} (s, a[i])\), so the result \(s_n\) computed by Program 3 equals:

\[
s_n = \bigoplus_{i=0}^{n-1} (s, a[i]) = \bigoplus_{i=0}^{n-1} \left( \bigoplus_{i=0}^{n-1} (s, a[i]), a[i] \right) = \bigoplus_{i=0}^{n-1} (u_{\mathbb{W}}, a[i])
\]

However, \(\min\!_{\mathbb{W}} \leq u_{\mathbb{W}} \leq \max\!_{\mathbb{W}}\), so by applying Theorem 1, with \(S_0\) replaced by \(u_{\mathbb{W}}\) and \(S_i\) replaced by \(u_{\mathbb{W}} + v_i\) in Program 4, we get the stated result.

\[\square\]

**Note** The proof of Corollary 1 assumes infinite precision arithmetic for computing \(S_i\) in Program 4, as opposed to the 32-bit arithmetic that is used for \(\min_i\) and \(\max_i\). As far as \(n\) does not exceed \(2^8\), 10ng integer computations are equivalent to using unlimited precision integer arithmetic. In particular this condition is verified in the ETSI / ITU reference implementations, where the main vector lengths are 40 and 160. On applications where the vector lengths may exceed \(2^8\), strip-mining of the reduction loop can be used to create inner loops that iterate no more than \(2^8\).

However, by applying Lemma 2 of the next section, we find that using 33-bit arithmetic in place of infinite precision arithmetic for computing \(S_i\) in Program 4 is enough: whenever \(S_i\) overflows in 33-bit arithmetic, \(\min\!_{\mathbb{W}} = \max\!_{\mathbb{W}}\), so \(S_n\) no longer contributes to the end result of Program 4. Thus strip-mining of the reduction loop is not necessary, and Program 2 actually works whatever the loop iteration count \(n\). In addition, Program 2 only requires that 10ng integers are larger than 32-bit.

### 2.3 Exploitation of a 4-MAC DSP with 32-bit Accumulators

One problem with the method of section 2.2 is that it requires a target DSP that computes three saturated 32-bit MACs, plus one non-saturated 40-bit MAC, per cycle. In this section, we show that Program 5 in figure 7 is bit-exact. The corresponding C code in Program 6 achieves an effective throughput of two iterations of the original saturated reduction loop per cycle on a 4-MAC DSP, and only requires 32-bit accumulations: three with saturation, and one that uses 32-bit modular integer arithmetic denoted \(\mathbb{W}\).

**Theorem 2** In figure 8, Program 4′ and Program 7 compute the same result.

A first remark is that if \(\min\!_{\mathbb{W}} = \max\!_{\mathbb{W}}\), then Program 4′ and Program 7 return the same result: \(\min\!_{\mathbb{W}} = \max\!_{\mathbb{W}}\). Hence we shall assume \(\min\!_{\mathbb{W}} < \max\!_{\mathbb{W}}\), and the proof reduces to Lemma 2, followed by the simple observations:

\[
\begin{align*}
(6) \quad & \Rightarrow \quad S_n = S_n + 2^{32} \geq \min\!_{\mathbb{W}} + 2^{32} = 2^{31} > \max\!_{\mathbb{W}} \\
(7) \quad & \Rightarrow \quad S_n = S_n - 2^{32} \leq \max\!_{\mathbb{W}} - 2^{32} = -2^{31} - 1 < \min\!_{\mathbb{W}} \leq \min\!_{\mathbb{W}}
\end{align*}
\]

This yields Program 7, that also works in the \(\min\!_{\mathbb{W}} = \max\!_{\mathbb{W}}\) case.

\[\square\]

**Lemma 2** If \(\min\!_{\mathbb{W}} \neq \max\!_{\mathbb{W}}\) then we have:

\[
\begin{align*}
\max\!_{\mathbb{W}} - \max\!_{\mathbb{W}} & \leq S_n - S \leq \min\!_{\mathbb{W}} - \min\!_{\mathbb{W}} \quad \iff \quad S_n = S_n \\
S_n - S & < \max\!_{\mathbb{W}} - \max\!_{\mathbb{W}} \quad \iff \quad S_n = S_n + 2^{32} \\
S_n - S & > \min\!_{\mathbb{W}} - \min\!_{\mathbb{W}} \quad \iff \quad S_n = S_n - 2^{32}
\end{align*}
\]
\[ u_0 := s \]
\[ v_0 := 0 \]
\[ \text{min}_0 := \text{min}32 \]
\[ \text{max}_0 := \text{max}32 \]
for \( i = 0 \) to \( \frac{n}{2} - 1 \)
\[ u_{i+1} := \text{saturation}(u_i + a[i]) \]
\[ v_{i+1} := v_i + a[i + \frac{n}{4}] \]
\[ \text{min}_{i+1} := \text{saturation}(\text{min}_i + a[i + \frac{n}{4}]) \]
\[ \text{max}_{i+1} := \text{saturation}(\text{max}_i + a[i + \frac{n}{4}]) \]
end for
\[ \text{inf} := \text{max}_n - \text{max}_{32} \]
\[ \text{sup} := \text{min}_n - \text{min}_{32} \]
if \( \text{inf} \leq v_{\frac{n}{4}} \leq \text{sup} \) then
\[ \text{return clip}_{\text{min}_{\frac{n}{4}}}^{\text{max}_{\frac{n}{4}}}(u_{\frac{n}{4}} + v_{\frac{n}{4}}) \]
end if
\[ \text{return max}_n \text{ if } v_{\frac{n}{4}} < \text{inf} \]
\[ \text{return min}_n \text{ if } v_{\frac{n}{4}} > \text{sup} \]

Program 5

Figure 7: Saturated reductions: parallelized code with only 32-bit accumulations.

If \( a[i] \geq 0 \), \( \text{max}_{i+1} = \text{saturation}(\text{max}_i + a[i]) \leq \text{max}_i + a[i] \) by Lemma 1, so \( \text{max}_{i+1} - \text{max}_i = a[i] = S_{i+1} - S_i \).

Else if \( a[i] < 0 \), \( \text{max}_{i+1} = \text{saturation}(\text{max}_i + a[i]) = \text{max}_i + a[i] \), and \( \text{max}_{i+1} - \text{max}_i = a[i] = S_{i+1} - S_i \).

Otherwise \( \text{max}_{i+1} \) has reached \( \text{min}32 \), a contradiction since \( \text{min}32 \leq \text{min}_{i+1} \leq \text{max}_{i+1} = \text{min}32 \Rightarrow \text{min}_{i+1} = \text{max}_{i+1} \Rightarrow \text{min}_n = \text{max}_n \). Summing the \( n \) inequalities \( \text{max}_{i+1} - \text{max}_i \leq S_{i+1} - S_i \) yields \( \text{max}_n - \text{max}_0 \leq S_n - S_0 \Leftrightarrow \text{max}_n - \text{max}_{32} \leq S_n - s \). Similarly, \( S_n - s \leq \text{min}_n - \text{min}_{32} \). This yields (8):

\[
\text{max}_n - \text{max}_{32} \leq S_n - s \leq \text{min}_n - \text{min}_{32} \tag{8}
\]

\[
-2^{32} + 1 = \text{min}32 - \text{max}_{32} < S_n - s < \text{max}32 - \text{min}32 = 2^{32} - 1 \tag{9}
\]

\[
-2^{32} + 1 = \text{min}32 - \text{max}_{32} \leq S_n - s \leq \text{max}32 - \text{min}32 = 2^{32} - 1 \tag{10}
\]

Equation (9) is implied by (8), while (10) holds because \( \overline{S}_n \) and \( s \in [\text{min}32, \text{max}32] \). Since \( S_n \) and \( \overline{S}_n \) are computed the same way, except for the modular 32-bit addition in case of \( \overline{S}_n \), we are left with only three possibilities:

\( \diamond \) \( S_n = \overline{S}_n \) Then (8) reduces to case (5).

\( \diamond \) \( S_n = \overline{S}_n + 2^{32} \) We show by contradiction that \( \overline{S}_n - s < \text{max}_n - \text{max}_{32} \), thus reducing to case (6).

From (8), we have \( \overline{S}_n - s + 2^{32} \leq \text{min}_n - \text{min}_{32} \). Subtracting these inequalities yields \( 2^{32} \leq \text{min}_n - \text{max}_n + \text{max}_{32} - \text{min}_{32} - 2^{32} = \text{min}_n - 1 < \text{min}_n \).

\( \diamond \) \( S_n = \overline{S}_n - 2^{32} \) Likewise one can show by contradiction that \( \overline{S}_n - s > \text{min}_n - \text{min}_{32} \), thus reducing to case (7).

Finally, as \( \text{min}_n \) and \( \text{max}_n \) are signed 32-bit integers, \( \text{max}_n - \text{min}_n \leq \text{max}_{32} - \text{min}_{32} \Leftrightarrow \text{max}_n - \text{max}_{32} \leq \text{min}_n - \text{min}_{32} \). Therefore, the inequalities \( \overline{S}_n - s < \text{max}_n - \text{max}_{32} \) and \( \overline{S}_n - s > \text{min}_n - \text{min}_{32} \) are exclusive.

\( \square \)

**Corollary 2** Program 5 computes the same result as Program 3.

Proof identical to the proof of Corollary 1.  

\( \square \)
3. Approximate parallel reductions

3.1 Problem Statement and Notations

Section 2 illustrates that satisfying the “hi-exact” requirements when unaligned data buses do not apply bitwise
computation reduces the required number of parallel MACs. In this section, we present approximate
transformations of the saturated reductions that are applied to DpQI results with only two parallel MACs. All these transformations can
reduce the number of parallel MACs from four to two. The advantage is that
the approximate parallel saturated reductions discussed here:

\[ S_{i} = \lambda \text{saturate} \left( \sum_{j=1}^{n} a_{ij} + \sum_{j=1}^{n} b_{ij} \right) \]

3.2 Program: approximate parallel reductions

3.2.1 Program 1: approximate parallel reductions

Then the different approximation cases we shall discuss for \( i \in \mathbb{L} \) are:

\[ \text{MAX}_{i} \equiv \min_{j} \left\{ a_{ij}, b_{ij} \right\} \]

\[ \text{MIN}_{i} \equiv \min_{j} \left\{ a_{ij}, b_{ij} \right\} \]

\[ \text{MAX}_{i} - \text{MIN}_{i} = \max_{j} \left\{ a_{ij}, b_{ij} \right\} \]

\[ \text{MIN}_{i} - \max_{j} \left\{ a_{ij}, b_{ij} \right\} \]

\[ \text{MAX}_{i} - \min_{j} \left\{ a_{ij}, b_{ij} \right\} \]

\[ \min_{j} \left\{ a_{ij}, b_{ij} \right\} - \max_{j} \left\{ a_{ij}, b_{ij} \right\} \]

The common theme of these approximate algorithms is to split the reduction into sub-reductions, which
are computed in parallel and combined at the end using exact addition. When using non-saturated
arithmetic (modulo integer arithmetic), overflows must be avoided by using higher precision arithmetic that
some algorithms expose spatial locality of memory accesses such as \( S_{i} \) and \( S_{i+1} \).
case 3 Saturation on both sides.

Each approximate algorithm achieves a tradeoff between three points:

**Accuracy** The main reason for using saturated arithmetic in fixed-point digital signal processing, instead of integer modular arithmetic, is the better behavior of the former in linear filtering applications. However, as saturations in linear filters implies signal distortion, it is a design goal of fixed-point digital signal processing algorithms to avoid saturation as much as possible. Consequently, we sort those three different cases in order of importance:

case 1 > case 2 > case 3

**Interleaving** There are two basic ways to split the reduction:

- either separate odd and even index, as \( \sum_{i=0}^{\frac{n-1}{2}} a[i] = \sum_{i=2}^{\frac{n-1}{2}} a[2i] + \sum_{i=0}^{\frac{n-1}{2}} a[2i+1]. \)
- or maintain the main order by summing the first \( \frac{n}{2} \) as well as the last \( \frac{n}{2} \) elements together, \( \sum_{i=0}^{\frac{n-1}{2}} a[i] = \sum_{i=0}^{\frac{n-1}{2}} a[i] + \sum_{i=\frac{n}{2}}^{\frac{n-1}{2}} a[i] \)

Interleaving potentially yields better performances, as it exposes spatial locality between \( a[2i] \) and \( a[2i+1] \). On the new DSP-MCUs, spatial locality enables memory access packing, that is, loading or storing a pair of 16-bit numbers as a single 32-bit memory access.

32-bit versus 40-bit We will see that saturating the sub-sums usually gives worse results than summing in higher-precision arithmetic, then saturating in the end.

### 3.2 Main Approximation Results

In this section, we denote: \( S \overset{\text{def}}{=} \lambda x. \bigoplus_{i=\frac{n}{2}}^{n-1}(x, a[i]). \)

Let us denote by \( s_m \) the result of \( \bigoplus_{i=0}^{m-1}(s, a[i]) \), so that by \( s_{\frac{n}{2}} = \bigoplus_{i=\frac{n}{2}}^{n-1}(s, a[i]) \). The original saturated reduction \( s_n = \bigoplus_{i=0}^{n-1}(s, a[i]) = \bigoplus_{i=\frac{n}{2}}^{n-1}(s_{\frac{n}{2}}, a[i]) \), and we are interested to know when \( \lambda x. \bigoplus_{i=\frac{n}{2}}^{n-1}(x, a[i]) \) saturates.

**Lemma 3 (Saturation of S)**

\[
S \text{ max-saturates } \iff x + \frac{n-1}{2} \maxSigma a[i] > \max32 \tag{11}
\]

\[
S \text{ min-saturates } \iff x + \frac{n-1}{2} \minSigma a[i] < \min32 \tag{12}
\]

\[
S \text{ min-max-saturates } \implies \maxSigma a[i] - \minSigma a[i] > \max32 - \min32 \tag{13}
\]

\[
S \text{ does not saturate } \implies \minSigma a[i] - \maxSigma a[i] \leq \max32 - \min32 \tag{14}
\]

Relations (11) and (12) are straightforward from the definition of \( \maxSigma \) and \( \minSigma \): because \( \min \) and \( \max \) are linear functions, \( x \) can be subtracted to each sum and moved out from the \( \min \) and \( \max \). Relations (13) and (14) are obtained by subtractions on (11) and (12).

**Lemma 4 (Saturation and extrema)** Let us denote by \( S_m = \bigoplus_{i=0}^{m-1}(0, a[i]) \):

If \( S_m \) max-saturates and does not min-saturate, then

\[
S_m = \max32 \iff \sum_{i=0}^{m-1} a[i] = \maxSigma a[i] \tag{15}
\]
If $S_m$ min-saturates and does not max-saturate, then

$$S_m = \min 32 \iff \sum_{i=0}^{m-1} a[i] = \MIN_{i=0}^{m-1} a[i]$$

Consider that $S_m = \max 32$ does not saturate on min. Suppose, by contradiction, that there exists $j < m$ such that $\sum_{i=0}^{j-1} a[i] > \sum_{i=0}^{m-1} a[i]$. Then, we have $\sum_{i=j}^{m-1} a[i] < 0$. Eventually because $\bigoplus_{i=0}^{m-1} (a[i])$ does not saturate on min,

$$\bigoplus_{i=j}^{j-1} \bigoplus_{i=0}^{j-1} (a[i], a[i]) \leq \bigoplus_{i=0}^{j-1} (a[i]) + \sum_{i=j}^{m-1} a[i]$$

$$< \bigoplus_{i=0}^{j-1} (a[i])$$

$$< \max 32$$

Reciprocally, suppose $\sum_{i=0}^{m-1} a[i] = \MAX_{i=0}^{m-1} a[i]$ and consider the largest index $0 \leq j \leq m$ such that $S_j = \max 32$. We have $S_m = S_j + \sum_{i=j}^{m-1} a[i]$. Then, because $\sum_{i=j}^{m-1} a[i] \geq 0$, $S_m = \max 32$. \qed

Theorem 3 (Exact value of S)

$S$ does not saturate $\implies$ \quad $S(x) = x + \sum_{i=x}^{n-1} a[i]$ \quad (15)

$S$ max-saturates does not min-saturate $\implies$ \quad $S(x) = \sum_{i=x}^{n-1} a[i] - \left( \MAX_{i=x}^{n-1} a[i] - \max 32 \right)$ \quad (16)

$S$ min-saturates does not max-saturate $\implies$ \quad $S(x) = \sum_{i=x}^{n-1} a[i] - \left( \MIN_{i=x}^{n-1} a[i] - \min 32 \right)$ \quad (17)

$S$ min-max-saturates $\implies$ \quad $S(x) = \bigoplus_{i=x}^{n-1} (a[i]) \forall x \in [\min 32, \max 32]$ \quad (18)

$\diamond$ To prove Relation (16) let us prove by induction on $m \geq j$ that if $S_m$ saturates on max, not on min:

$$\left( S_m = \bigoplus_{i=0}^{m-1} (a[i]) \right) = \sum_{i=0}^{m-1} a[i] - \left( \MAX_{i=0}^{m-1} a[i] - \max 32 \right)$$

(19)

where $j$ is the smallest index such that $\sum_{i=0}^{j-1} a[i] > \max 32$.

Because $S_j = \max 32$ and $\sum_{i=0}^{j-1} a[i] = \MAX_{i=0}^{j-1} a[i]$, and the induction hypothesis is true for $m = j$. Let us consider $m$ such that (19) is true. There are two possibilities:

1. $S_{m+1} = \max 32$, and the result directly extends from Lemma 4.

2. $S_{m+1} = S_m + a[m] < \max 32$, and thanks to Lemma 4, $\sum_{i=0}^{m} a[i] \neq \MAX_{i=0}^{m} a[i]$, so $\MAX_{i=0}^{m} a[i] = \MAX_{i=0}^{m-1} a[i]$. The result follows from the relation on $m$.

$\diamond$ Let us prove Relation (18). Suppose without loss of generality that $x \leq x'$. Because

$x \leq y \implies \operatorname{saturate}(x + a) \leq \operatorname{saturate}(y + a)$

We have,

$$\forall m, \bigoplus_{i=x}^{m-1} (a[i]) \leq \bigoplus_{i=x}^{m-1} (a[i]) \leq \max 32$$

10
Now, since $S$ saturates on $\max$, if we denote by $j$ an index where it saturates, we have
\[
\max 32 = \bigoplus_{i = \frac{n}{2}}^{j - 1} (x, a[i]) = \bigoplus_{i = \frac{n}{2}}^{j - 1} (x', a[i])
\]
\[
\blacksquare
\]

**Corollary 3 (Comparison on case 1)** If $S$ does not saturate, then:
\[
S(x) = \text{saturate} \left( x + \bigoplus_{i = \frac{n}{2}}^{n-1} (a[i]) \right) \iff \bigoplus_{i = \frac{n}{2}}^{n-1} (a[i]) \text{ does not saturate}
\]

The reciprocal is clearly true.
Suppose by contradiction that $S' = \bigoplus_{i = \frac{n}{2}}^{n-1} (a[i])$ saturates. Since $S$ does not saturates, from relations (13) and (14), we can prove by contradiction that $S'$ does not saturate both on $\max$ and $\min$. Hence, consider without loss of generality that $S'$ saturates on $\max$, not on $\min$. From Relation (16) we get
\[
\text{saturate} \left( x + \bigoplus_{i = \frac{n}{2}}^{n-1} (a[i]) \right) = \text{saturate} \left( x + \sum_{i = \frac{n}{2}}^{n-1} a[i] - \left( \max_{i = \frac{n}{2}}^{n-1} a[i] \right) \right)
\]
\[
= \text{saturate} \left( S - \left( \max_{i = \frac{n}{2}}^{n-1} a[i] \right) \right)
\]
\[
= S - \left( \max_{i = \frac{n}{2}}^{n-1} a[i] \right)
\]
\[
\neq S
\]
\[
\blacksquare
\]
As a consequence of Theorem 3 and Corollary 3, in case 1 on $[\frac{n}{2}, n]$, $S_1$ is better than $S_2$.

**Theorem 4 (Comparison on case 2)** Suppose $x \neq 0$, $S$ max-saturates, and $S$ does not min-saturate. Then:
\[
S(x) = \text{saturate} \left( x + \sum_{i = \frac{n}{2}}^{n-1} a[i] \right) \iff \max_{i = \frac{n}{2}}^{n-1} a[i] = \sum_{i = \frac{n}{2}}^{n-1} a[i] \tag{20}
\]
\[
S(x) = \text{saturate} \left( x + \bigoplus_{i = \frac{n}{2}}^{n-1} (a[i]) \right) \iff \max_{i = \frac{n}{2}}^{n-1} a[i] = \sum_{i = \frac{n}{2}}^{n-1} a[i] \wedge x > 0 \tag{21}
\]

Which is true in particular when $a[i] \geq 0 \forall i$.

\* Let us start with the proof of (20). From (16) we have
\[
S(x) = x + \sum_{i = \frac{n}{2}}^{n-1} a[i] - \left( x + \max_{i = \frac{n}{2}}^{n-1} a[i] - \max 32 \right)
\]
Also from (11) we get that
\[
x + \max_{i = \frac{n}{2}}^{n-1} a[i] - \max 32 > 0
\]
Consequently,
\[
S(x) \neq x + \sum_{i = \frac{n}{2}}^{n-1} a[i]
\]
11
and
\[ S(x) = \text{saturate} \left( x + \sum_{i=0}^{n-1} a[i] \right) \Leftrightarrow S(x) = \text{max32} \]

Eventually, we apply Lemma 4.

\( \diamond \) To prove (21) we need to consider two cases:

1. \( \oplus_{i=0}^{n-1}(a[i]) \) saturates on \( \text{max} \). By applying Theorem 3 both on \( \oplus_{i=0}^{n-1}(x, a[i]) \) and \( \oplus_{i=0}^{n-1}(0, a[i]) \), we get \( S(x) = \oplus_{i=0}^{n-1}(0, a[i]) \). Now, \( x \neq 0 \) and \( x + \oplus_{i=0}^{n-1}(a[i]) \neq S(x) \). So, \( \text{saturate} \left( x + \oplus_{i=0}^{n-1}(a[i]) \right) = S(x) \) implies \( S(x) = \text{max32} \), and the equivalence relation follows.

2. \( \oplus_{i=0}^{n-1}(a[i]) \) does not saturate on \( \text{max} \). First, because \( S \) saturates on \( \text{max} \), necessary \( x > 0 \). Moreover, \( x + \oplus_{i=0}^{n-1}(a[i]) \geq x + \sum_{i=0}^{n-1} a[i] > \oplus_{i=0}^{n-1}(x, a[i]) = S(x) \). Hence, \( \text{saturate} \left( x + \oplus_{i=0}^{n-1}(a[i]) \right) = S(x) \Leftrightarrow S(x) = \text{max32} \).

As a consequence of Theorem 4, in case 2 on \( \left[ \frac{1}{2}, n - 1 \right] \), \( S_1 \) is better than \( S_2 \).

**Corollary 4 (Comparison on case 3)** If \( x \neq 0 \) and \( S \) min-max-saturates:

\[
S(x) = \text{saturate} \left( x + \sum_{i=0}^{n-1} a[i] \right) \quad \Leftrightarrow \quad \{ \begin{array}{l}
\text{MAX} \sum_{i=0}^{n-1} a[i] = \sum_{i=0}^{n-1} a[i] \\
\text{MIN} \sum_{i=0}^{n-1} a[i] = \sum_{i=0}^{n-1} a[i]
\end{array}
\]

\[
S(x) = \text{saturate} \left( x + \sum_{i=0}^{n-1} (a[i]) \right) \quad \Leftrightarrow \quad \{ \begin{array}{l}
\text{MAX} \sum_{i=0}^{n-1} a[i] = \sum_{i=0}^{n-1} a[i] \land x > 0 \\
\text{MIN} \sum_{i=0}^{n-1} a[i] = \sum_{i=0}^{n-1} a[i] \land x < 0
\end{array}
\]

Thus in case 3 on \( \left[ \frac{1}{2}, n - 1 \right] \), \( S_1 \) is better than \( S_2 \).

### 3.3 Classification of the Approximate Algorithms

On the light of previous remarks and theorems, we compared the four different algorithms \( (S_1, S_2, S_3, S_4) \) and summarized their characteristics: \( S_4 \) is the worst approximation because whenever one of its sub-reductions or the original sum \( \oplus_{i=0}^{n-1}(s, a[i]) \) saturates, the result is not “bit-exact”.

The main drawback of \( S_1 \) is that it requires the target processor to operate on numbers larger than 32-bit, for instance 40-bit on most fixed-point DSPs. In that case, the loop iteration count has to be no larger than \( 2^n \), else strip-mining of the reduction loop is required.

When the target processor is limited to 32-bit arithmetic, or suffers significant performance degradations when operating at higher precision, the \( S_2 \) or \( S_4 \) approximate techniques should be used to parallelize the saturated reductions.
<table>
<thead>
<tr>
<th>Condition for Correctness</th>
<th>Properties</th>
</tr>
</thead>
<tbody>
<tr>
<td>( S_1 )</td>
<td>Needs 40-bit operations. Better than ( S_2 ) and ( S_3 ).</td>
</tr>
</tbody>
</table>

\[\begin{align*}
\circ & \text{ case 1 on } \left[ \frac{p}{2}, n \right] \\
& \text{ or } \max_{i=0}^{n-1} a[i] = \sum_{i=0}^{n-1} a[i] \\
& \text{ or } \min_{i=0}^{n-1} a[i] = \sum_{i=0}^{n-1} a[i] 
\end{align*}\]

<table>
<thead>
<tr>
<th>Condition for Correctness</th>
<th>Properties</th>
</tr>
</thead>
<tbody>
<tr>
<td>( S_2 )</td>
<td>Only 32-bit operations. Better than ( S_4 ).</td>
</tr>
</tbody>
</table>

\[\begin{align*}
\circ & \text{ case 1 on } \left[ \frac{p}{2}, n \right] \land \text{ no saturation of } \bigoplus_{i=0}^{n-1}(a[i]) \\
& \text{ or } \max_{i=0}^{n-1} a[i] = \sum_{i=0}^{n-1} a[i] \land \bigoplus_{i=0}^{n-1}(s, a[i]) > 0 \\
& \text{ or } \min_{i=0}^{n-1} a[i] = \sum_{i=0}^{n-1} a[i] \land \bigoplus_{i=0}^{n-1}(s, a[i]) < 0 
\end{align*}\]

<table>
<thead>
<tr>
<th>Condition for Correctness</th>
<th>Properties</th>
</tr>
</thead>
<tbody>
<tr>
<td>( S_3 )</td>
<td>Spatial locality. Needs 40-bit operations. Better than ( S_4 ).</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Condition for Correctness</th>
<th>Properties</th>
</tr>
</thead>
</table>

\[\begin{align*}
\circ & \text{ case 1 on } [0, \frac{p}{2}] \land \text{ same conditions as } S_1. \\
& \text{ and neither } \bigoplus_{i=0}^{n-1}(s, a[2i]) \text{ nor } \bigoplus_{i=0}^{n-1}(a[2i+1]) \\
& \text{ saturates} 
\end{align*}\]

Also, so as to illustrate the non-correctness of several algorithms, we provide counter-examples in Table 1. In this table, we use the previously introduced notations, in addition to:

\[
S_5 = \text{sat}(s + \bigoplus_{i=1}^{n}(a[n-i]))
\]

\[
S_6 = \text{sat}(\bigoplus_{i=0}^{n-1}(s, a[i]) + \bigoplus_{i=1}^{n}(a[n-i]))
\]

\[
S_7 = \text{sat}\left(\bigoplus_{i=0}^{n-1}(a[i])\right)
\]

|  |  |  |  |  |  |  |  |  |
|---|---|---|---|---|---|---|---|
| a | \( S \) | \( S_1 \) | \( S_2 \) | \( S_3 \) | \( S_4 \) | \( S_5 \) | \( S_6 \) | \( S_7 \) |
| -1 | 7 | 5 | -2 | -4 | 1 | 10 | 7 | 2 | -2 | 13 | 15 | 15 | 15 | 14 | 15 |
| 10 | 7 | -8 | 1 | -12 | -1 | 13 | 4 | -2 | -7 | 3 | 3 | 2 | 5 | 5 | 5 | 3 | 6 |
| 6 | 6 | 5 | -2 | -10 | -1 | -5 | 13 | 4 | -1 | 13 | 13 | 13 | 15 | 14 | 14 | 12 | 10 |
| 6 | 7 | 4 | -2 | 3 | -5 | -4 | 2 | 3 | -1 | 10 | 10 | 10 | 13 | 13 | 13 | 10 | -5 |
| -2 | 9 | -13 | 3 | -6 | -2 | 5 | 7 | 1 | 8 | 10 | 10 | 6 | 10 | 5 | 4 | 4 | 15 |
| 1 | 2 | -1 | 3 | 4 | -3 | 1 | -2 | 5 | 3 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 4 |

Table 1: Counter examples: for the sake of simplicity, the examples are scaled to min32 = -16, max32 = 15, \( n = 10 \), and \( s = 0 \).

## 4 Conclusions

This paper addresses the problem of improving the performance of the saturated reductions on fixed-point Digital Signal Processors (DSPs), under the requirement to implement “bit-exact” variants of the reference telecommunications algorithms such as the ETSI EFR-5.1.0, the ETSI AMR-NB, the ITU G.723.1, and the ITU G.729. This problem is motivated by the need to exploit the instruction-level parallelism available on
the new generation of DSP-MCUs, in particular the Texas Instruments C6200 and C6400 series [8], the StarCore SC140 [6], and the STMicroelectronics ST120 [7].

On the ETSI ans the ITU reference implementations, several saturated reductions loops can be parallelized by applying unroll-and-jam, that is, unrolling of the outer loop into the inner loop so as to create more parallel work. When unroll-and-jam is not applicable, the arithmetic properties of the saturation operator allow to compute two saturated reduction steps per cycle, at the expense of four multiply-accumulate operations per cycle. Our main technique requires one 40-bit integer accumulation, and three 32-bit saturated accumulations, per cycle. Then we show how to replace this 40-bit integer accumulation by a 32-bit integer accumulation.

When the “bit-exact” requirement can be relaxed, more efficient but approximate techniques can be used to parallelize the saturated reductions. Based on further arithmetic properties of the saturation operator, we compare the approximate techniques to each other. In particular, the commonly used technique $S_4$ that computes two interleaved saturated sub-reductions achieves the worst approximation. We find that the most precise approximate technique $S_4$ computes the first half of the reduction in order using 32-bit saturation, and the second half using 40-bit integer arithmetic.

References


