Unfolding a recursive/recurrent computation into a CG that has a repetitive structure (eg, chain of events).

Unfolding such graph results in sharing of parameters across a deep network structure.

Classical form of a dynamic system:

s^t = f(s^{t-1}; w),

s^t is the system state, w is parameter.

A dynamical system driven by an external signal x^t:

s^t = f(s^{t-1}, x^t; w).

Any function can be viewed as a FNN. Any function involving recurrence can be viewed as a RNN.

In RNN, the hidden units of network is the state, so rewrite above, we get:

h^t = f(h^{t-1}, x^t; w).

Typical RNNs will add extra architectural features (eg. output layers that read info from state h to make predictions).

In predictive RNN:

- h^t is used as a kind of
*lossy summary*of task-relevant aspects of past sequence of inputs up to t - this summary is lossy, since it maps an arbitrary length sequence S = [x^t, x^{t-1}, x^{t-2}, …, x^{2}, x^{1}] to a fixed length vector h^t
- depending on training criterion, h^t might selectively keep some aspects of past sequence with more precision than other aspects.
- eg, in RNN-based statistical language modeling (predict next word given previous words), storing only enough info to predict the rest of sentence is sufficient. no need to store all info in input seq up to time t.
- most demanding case is when we ask h^t to be rich enough to approximately recover input seq, as in autoencoder frameworks

Let’s make a function g^t:

h^t = g^t (S) = g^t (x^t, x^{t-1}, x^{t-2}, …, x^{2}, x^{1}) = f(h^{t-1}, x^t; w)

function g^t takes the whole past seq S as input, produces the current state h^t.

The unfolding process introduces 2 major advantages:

- Regardless of the sequence length, the learned model always has the same input size, because it is speciﬁed in terms of transition from one state to another state, rather than speciﬁed in terms of a variable-length history of states.
- It is possible to use the
*same*transition function f with the*same*parameters at every time step.

These two factors make it possible to learn a single model f that operates on all time steps and all sequence lengths, rather than needing to learn a separate model g^t for all possible time steps. Learning a single shared model allows generalization to sequence lengths that did not appear in the training set, and enables the model to be estimated with far fewer training examples than would be required without parameter sharing.

In other words, it is possible to:

- just learn a single model f:
- allows generalization to unseen seq length
- model can be estimated with far fewer training samples

- no need to learn a separate model g^t for all possible time steps

Comparison between recurrent graph and unrolled graph:

- recurrent graph is succinct
- unfolded graph: explicit description; show info flow forward in time (computing outputs and losses) and backword in time (computing gradients)

[sundeepblue] Does parameter sharing mean we just use the same w in all h^t?

In essence, two weapons in this section:

**graph-unrolling****parameter-sharing**