Skip to main content

Introducing Jais-2, a frontier Arabic model, running at over 2,000 tokens/second. Read more >>>

Dec 12 2025

Thinking Inside the Box: The Implicit Chain Transformer for Efficient State Tracking

Motivation

Large Language Model (LLM) decoders have demonstrated remarkable capabilities in open-ended generation, reasoning, and human-computer interaction. However, the standard autoregressive formulation suffers from a representational bottleneck: to generate the next token, the model must implicitly re-derive the underlying semantic context by attending to the entire history. This statelessness renders standard Transformers surprisingly brittle on tasks necessitating the maintenance of a running state—such as calculating the sum of a list of numbers modulo X or performing graph traversal.

In this work, we introduce the Implicit Chain Transformer (ICT), a novel architecture designed to bridge this gap. By propagating a learnable "intent" latent vector forward across time steps, our method enables the model to explicitly update and contextualize a running state, rather than solely relying on the re-derivation of attention patterns over the history. Early evaluations demonstrate that the ICT achieves strong accuracy on state-intensive toy tasks - identified as challenging for Transformers by prior work - without incurring the inference latency costs associated with Chain-of-Thought (CoT) prompting. This work represents a foundational step towards our broader objective: enabling Transformers to perform efficient, robust reasoning entirely within latent space.

The Challenge: The Stateless Bottleneck

Standard Transformers suffer from a fundamental representational bottleneck: they are stateless. To generate token t, the model must re-derive the current context by attending to all previous tokens. It cannot simply "remember" the current state (e.g., "sum is 5"); it must implicitly recalculate it at every step. This makes them brittle on tasks requiring deep sequential dependency, leading to hallucinations as context grows.

Task Setup: Modulo Arithmetic & Graph Traversal

We targeted two tasks where "approximate" attention can face challenges as noted by prior work [1]:

  • Sum Modulo X: Requires maintaining a precise running total. A single "carry" error corrupts the entire future sequence. In theory, this is a parallel reduction task where each token can independently compute the answer instead of building on the previous partial outcome. However, as the sequence length increases, this form of re-computation at every token becomes too expensive for the model.
  • Graph Traversal: Requires tracking a path through a network. The model must respect connectivity, creating strict sequential dependence across tokens. This task places a much higher bar on the quality of state tracking and propagation, given the tight reliance across time steps.

The Proposal: Implicit Chain Transformer

We introduce the Implicit Chain Transformer (ICT). Unlike standard models that restrict information flow to vertical (layer-to-layer) movement, the ICT propagates a learnable "Intent Vector" (zt) horizontally across time steps. This vector acts as a compressed working memory: the model "writes" the current logical state into zt​ and passes it to t+1, maintaining continuity without re-processing the entire history. We investigate two distinct strategies for this state propagation:

1. Autoregressive Intent Propagation (Dense) In this formulation, the state is updated continuously. For every single token generation step, we capture the latent vector from the final layer of token t and inject it into the early layers of token t+1. This mirrors standard autoregressive decoding but adds a persistent "memory stream" alongside the token stream, allowing the model to micro-manage state at the word level.

2. Periodic Intent Propagation (Sparse) Here, we decouple state updates from token generation. We inject special <THINK> tokens at regular intervals in the input stream.

  • Mechanism: We restrict intent propagation to occur only at these <THINK> boundaries.
  • Intuition: This forces the model to treat these tokens as "semantic checkpoints", aggregating the preceding context into a coherent summary before moving on. By focusing state updates only where the model is explicitly developing high-level concepts, we align the intent propagation architecture with the logical structure of the data rather than the arbitrary tokens. This approach also offers significant advantages for inference optimization, which we will detail in a future post.

Training with "Iterative Latent Feedback" The primary challenge with recurrence is that it typically breaks the parallel training efficiency that makes Transformers scalable. To enable intent propagation without reverting to the slow, sequential training of RNNs, we introduce a multi-pass approximation:

  1. Parallel Pass: We process the entire sequence in parallel (standard Transformer mode) to generate initial latent representations.
  2. Feedback Injection: We take the latent vector from the final layer of the first pass, project it via an MLP, and fuse it as the "intent" input for the early layers.
  3. Refinement Pass: We perform a second forward pass with this fused context to compute the final loss.

While this introduces a fixed computational overhead (requires additional forward passes), the cost is constant regardless of sequence length. This preserves the O(1) sequential complexity of Transformer training, avoiding the crippling O(N) training time associated with Backpropagation Through Time (BPTT) in RNNs.

Early Evaluation

We evaluate on the above tasks using a standard GPT-2 style decoder-only model. For these evaluations, we use following network configuration:

Implicit Chain Transformer Variations:We layer our modifications on the baseline model as follows:

1. Intent Vector : A new, dedicated MLP network called the GenMLP reads the output from an intermediate(last-2nd layer for 8 layer decoder model ) or final layer(3-decoder-layer model) of the LLM .

2. Intent Propagation: The intent Vector is fused with the output of the first decoder Layer using a dedicated Fuse-Intent network

3. GenMLP network : This network processes the intermediate output to generate the Intent Vector. Architecture is as below

  1. num_hidden_layers=2
  2. expansion_factor=4
  3. activation=gelu
  4. norm_type=layernorm
  5. use_residual=True

4. Fuse-Intentnetwork : This Network achieves the Intent Propagation by fusing first layer output with the Intent Vector . The Architecture is described below

Early Evaluation Findings

We observe that ICT consistently and significantly outperforms the baseline across all tasks. We plot accuracy as a function of token position (or block index), effectively measuring the model's ability to maintain state as the computation depth increases.

  • Baseline Limitations: The baseline model (pink line) reveals how standard Transformers fail in different ways depending on task complexity.
    • For Modulo 37 and Permutation Prediction, we see a catastrophic collapse: the model maintains perfect accuracy for a fixed window (likely fitting within its effective attention span) before plummeting to near-zero accuracy almost instantly.
    • For Modulo 97, we observe a smooth, progressive degradation. As the difficulty of the larger modulus accumulates, the model's ability to attend to the full history erodes steadily rather than breaking suddenly. This warrants further investigation to understand the cause for this gradual decline.
  • Modulo Addition: Both the ICT-Dense and ICT-Sparse variants maintain high accuracy significantly longer than the baseline. This validates the core hypothesis: propagating a latent vector prevents the need for state recomputation that plagues standard Transformers.
  • Graph Traversal (Permutation): For the permutation task, the distinction between the two ICT variants becomes clearer. While both outperform the baseline, the ICT-Dense (green) model maintains stability for a larger number of blocks compared to the ICT-Sparse (purple) model.
    • Hypothesis: We attribute this gap to the challenge of compression. Since our prototype is a relatively shallow network (three layers), the Sparse model likely struggles to compress the full block representation into the periodic "think" tokens effectively. The Dense model, by updating its state at every step, avoids this information bottleneck and sustains accuracy for longer sequences.

Future Work

While these early results are promising, they represent only the first step toward robust latent reasoning. We plan to extend this work along several key vectors:

  • Inference Formulation and Optimization: The current evaluation focuses on accuracy, but the architectural benefits of the ICT - particularly the sparse formulation - extend significantly to speed. We are currently sketching out a detailed analysis of the inference mechanics and latency trade-offs, which we will share in a follow-up post.
  • Scaling the Sparse Formulation: We believe the ICT-Sparse approach offers the optimal balance between representational richness and computational cost. However, its current performance gap on the permutation task suggests we need further ablations to understand its scaling laws. Specifically, how it behaves with deeper networks and across a more varied set of reasoning tasks.
  • Training Dynamics & Stability: A critical open question remains the stability of long-range dependency training. We plan to rigorously investigate methods to encourage richer latent representations, including comparing full Backpropagation Through Time (BPTT) against stop-gradient techniques (like those used in MuZero or standard recurrent detachments) to balance gradient stability with multi-step iteration of state representation.