Paper link: https://arxiv.org/abs/2111.00396
Homepage: https://github.com/state-spaces/s4
This is a well-written paper. I believe it can be used as an good example for scientific writing. It is worth analysing the structure when comes to writing a paper. Even though many details are quite hard to understand and need more readings of other related materials, the essence of this paper is well emphasized to grasp.
Authors proposed an efficient sequence modeling method called Structured State Spaces Model (S3 model), which inspired from state spaces model. The abstract of the paper capture the essence of the paper and directly quoted here,
A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \(x^{'}(t) = Ax(t) + Bu(t)\), \(y(t) = Cx(t) + Du(t)\), and showed that for appropriate choices of the state matrix \(A\), this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space (S4) sequence model based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning A with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel.
The annotated S4 is a good reference. https://srush.github.io/annotated-s4/
Fund the relavent papers from the same group are Combining recurrent, convolutional, and continuous-time models with linear state space layers
The State Space Models are used in many places. For example, it has been extensively explored in control theory. More read regarding to SSM is needed in order to understand more.
The SSM starts as follow,
\begin{equation} \label{eq:ssm} \begin{split} x^{'}(t)=\mathbf{A}x(t)+\mathbf{B}u(t) \\ y(t)=\mathbf{C}x(t)+\mathbf{D}u(t) \end{split} \end{equation}
As stated in Section 2.1,
SSMs are broadly used in many scientific disciplines and related to latent state models such as Hidden Markov Models (HMM). Our goal is to simply use the SSM as a black-box representation in a deep sequence model, where A, B, C, D are parameters learned by gradient descent.
An interesting question to ask is how is the SSM related to HMM. How the parameters are trained with gradient descent. But this is not quite relavent to this paper.
The HiPPO
Instead of ramdonly generating the initial point of \(\mathbf{A}\) in Eq. \(\eqref{eq:ssm}\), it is specified as described in the paper as follow,
The importance of this matrix is interesting, need more reading of the HiPPO paper to get more. But the performance is interesting stated in this paper as,
For example, the LSSL (
I added) found that simply modifying an SSM from a random matrix \(\mathbf{A}\) to equation (2) improved its performance on the sequential MNIST benchmark from 60% to 98%.
This is typical in digital signal processing that converts continuous-time signal to discrete-time signal. Usually there will be a sampling rate \(Fs\) and it’s associated step size \(\Delta\). There shall have many literatures to explore the topic converting continuous-time SSM to discrete-time SSM. This paper gives rough idea and shows the equations after the discretization. I assume the discussion the connection between continuous-time SSM to discrete-time SSM, trade-off, limitations, properties are discussed in previous literatures, such as this one
The Eq. \(\eqref{eq:ssm}\) becomes,
\begin{equation} \label{eq:discrete-time-ssm} \begin{split} &x_{k}= \bar{\mathbf{A}} x_{k-1} + \bar{\mathbf{B}}u_{k} \quad \bar{\mathbf{A}}=(\mathbf{I} - \Delta /2 \cdot \mathbf{A})^{-1}(\mathbf{I} + \Delta /2 \cdot \mathbf{A}) \\ &y_{k }= \bar{\mathbf{A}} x_{k} \quad \quad \bar{\mathbf{B}}=(\mathbf{I} - \Delta /2 \cdot \mathbf{A})^{-1} \Delta \mathbf{B} \quad \bar{\mathbf{C}}=\mathbf{C}. \end{split} \end{equation}
Note that \(\mathbf{D}u(t)\) is omitted for simplicity.
the state equation is now a recurrence in \(x_{k}\), allowing the discrete SSM to be computed like an RNN. Concretely, \(x_{k} \in \mathbb{R}^{N}\) can be viewed as a hidden state with transition matrix \(\mathbf{A}\).
The contribution of the paper lies on how authors proposed to efficiently train/compute the convolutional representation of the SSM. It reasons as follow,
where \(\bar{\mathbf{K}}\) is refered as SSM convolution kernel, but it is non-trivial to calculate.
As can be seen in Eq. (4-5) in the paper, the computation reqires repeating multiplication by \(\bar{\mathbf{A}}\), which motivates the authors to select structured, canonical format \(\bar{\mathbf{A}}\). However, naive converting it to diagonal matrix is not practical still. Instead, authors made the obervation that the HiPPO matrix can be decomposed as the sum of a normal and low-rank matri and then applying spectral theorem of linear algebra.The reasons will need more readings.
From a rough reading and thinking, the solution combines converting \(\bar{\mathbf{A}}\) into a perticular form through using unitary \(\mathbf{V}\), diagonal \(\mathbf{\Lambda}\), and low-rank factorization \(\mathbf{P}\), \(\mathbf{Q}\). And using Cauchy kernel to compute the repeating computation of matrix \(\bar{\mathbf{A}}\).
The paper reached 4 theorms that smmarize the importance of the paper.