1 \textit{Mersenne Twister Generators} or MT are a word-wise recurrence matrix instead of bit-wise as for LFSR. The first implementation of this generator use a generalized version of Feedback Shift Register called "GFSR". It generates an output as a state vector or an array flowing this equation $x[t] = \ (x_{M1} \ \bigotimes \ x_{M2} \ \bigotimes \ ...)$, where M is the middle word $1 \ < \ M \ < \ t$. However, the GFSR are known to not reach the maximum period. To resolve that, a matrix recurrence twisted for GFSR "TGFSR" has been proposed to increase the period. It uses a feedback polynomial of $x_{t+k} = \ [x_{M+k} \ \bigotimes \ (x_k^u \ | \ x_{k+1}^l)\ .A]$, where k=1,2,3 and $x[0]^{u}, \ x[1]^{l}$ are the less and most significant bit of $x_{t}$ (u,l are tempering bit shifts). To uniform the outputs of TGFSR, they are tempered by a bitwise multiplication using a binary matrix $y = \ x_{t}*T$ . Also, it performs the statistical properties by reducing the dimensionality of equidistribution. In the same purpose for a long period PRNG, the Mersenne twister (MT) is proposed as a special case of TGFSR. We can find two main implementation of MT, the MT11213 with a period of $2^{19937-1}$ and the MT19937 with a period of $2^{19937-1}$.
3 We can start by some implementation of TGFSR that are cited in \cite{banks2008fpga}. Here, the authors compare two FPGA implementation version of TGFSR. The first version implement two Mersenne Twister PRNG "MT19937" and "MT11213" but using \emph{Generalised Feedback Shift Registers} (GFSRs) and without any hardware optimization. Then, two optimized TGFSR version "Ran" and "Ranq1" has been proposed based on just multiplication of fixed precision integers (with overflow). The comparison presented, shows the two TGFSR of the second version has a good area optimization unlike of the first version that use block memories RAMs for multiplications.
5 As for Mersenne Twister, we can find in \cite{wu2013hardware}. Here, the authors propose a 3-stages pipeline version of MT19937. It uses a dual-port BRAM memories of the FPGA to generate a long period and for multiplication operation. However, it takes 3 cycle for every sample bit generated. In the other hand and in \cite{li2012efficient}, the authors present an MT Implementation with just 3 read for ever 1 write into the RAM structure. This technique is resulting a single sample per cycle. Another pipeline MT19937 implementation is in \cite{chandrasekaran2008high} using a single block RAMs to store the state vector. The same approach has been done in \cite{tian2009mersenne} for "MT2203" but using multi ported block of RAMs to reduce HW resource instead of using multi/single RAMs with limited block ports. Hence, compared to the two early version of parallel MT. In \cite{dalal2008fast}, the authors propose two parallel version of "MT19937" by reducing the IO ports and bank memories RAMs. The first version is the \emph{Interleaved Parallelization} (IP) that generate a N-bit for every "P" memories bank separately and during a period of $T= \ (P/N_{bits})*(N_{IO}/clock)$. The second version is \emph{Chunked Parallelization} (CP) that claim to use a fewer block RAMs compared to the IP version. It's based on the idea to use the output bit of each RAM bank as the far recurrence input for the following RAM bank.
7 Finally, a recent paper presented in \cite{echeverria2013high}. Here, the authors compare another simple implementation of MT19937 that use RAM banks memories as storage element with a new alternative solution of RAMs called \emph{circular buffer}. The new solution is based on the fixed relationship between the indexes of the words. It considers the replacement of each step of the recurrence $x_{t}$ with $x_{t+k}$ in the work area. This way, the linear recurrence and the buffer of registers can be considered as a circular buffer. Here, the linear recurrence is carried out by some combinational logic between the input and the output of the buffer. Hence, the architecture is simplified as no logic for the table indexes is needed.