Releasing the First Open 1B+ Language Model with Hyperconnections

Today we are releasing Goedel-mHC-1B — the first open 1B+ pretrained language model using multi-stream Hyperconnections (mHC). The weights are available on HuggingFace under the Apache 2.0 license, along with a standard transformer baseline trained under identical conditions. Both models, same data, same compute — architecture is the only variable.

What are Hyperconnections?

Every transformer passes information through a single residual stream. Each layer reads from it, transforms what it reads, and writes the result back. This is the “residual connection” you see in every architecture diagram — the arrow that skips around each layer and adds the output back in. It works, but it is an information bottleneck. Every layer must share this one channel, and information from early layers can get diluted or overwritten by later ones.

Hyperconnections, introduced by Zhu et al. (2024), replace this single stream with multiple parallel streams. Think of it as giving the model several independent channels to carry information through the network, rather than forcing everything through one pipe. Between layers, the streams interact through learned mixing matrices; the model figures out how to route information between channels during training.

Our implementation uses the “manifold-constrained” variant (mHC) from Wenfeng et al., 2024, with 4 parallel streams. Each stream carries the full hidden dimension (2,048), so the inter-block representation is 8,192-dimensional — the model has four times the bandwidth between layers to carry information through the network. The mixing matrices are constrained to be doubly-stochastic via Sinkhorn-Knopp iterations on a small 4x4 logit matrix, which keeps training stable by ensuring that information is redistributed across streams rather than concentrated or destroyed. A learned pre-mixing vector combines all four streams into a single input for each sublayer (attention or FFN), and a learned post-mixing vector distributes the sublayer output back across all streams.

A key property of this design: at initialization, mHC exactly recovers standard pre-norm residual connections. The model starts as a normal transformer and gradually learns to use the extra streams during training. This means the architecture can only help, never hurt — at worst, it ignores the extra capacity. This gave us confidence it was worth validating at the 1B scale with open weights.

What we built

Goedel-mHC-1B is not just Hyperconnections bolted onto a standard transformer. The full architecture combines four innovations drawn from recent papers and the NanoGPT speedrun community:

  • Gated GQA (inspired by Qwen) Grouped-query attention with a learned sigmoid output gate. The gate eliminates attention sink tokens (those first-position tokens that absorb disproportionate attention weight) and prevents the bf16 loss spikes that plague standard attention at scale.
  • ReLU-squared (from the NanoGPT speedrun lineage) The feed-forward network uses relu(x)^2 instead of SwiGLU. Squared ReLU produces sparser activations, is simpler, and is more fusible by the compiler.
  • mHC with 4 streams Multi-stream residual connections with Sinkhorn-constrained mixing, as described above. The inter-block hidden state is 8,192-dimensional (4 streams of 2,048).
  • NorMuon optimizer Muon for all 2D weight matrices, Adam for 1D parameters and embeddings, with a trapezoidal learning rate schedule. Muon applies Newton’s method in the spectral domain, which significantly accelerates training of large matrices.

The model has 1,009M parameters, 24 layers, and a hidden dimension of 2,048. It uses RoPE positional encoding, QK-norm for training stability, weight tying between the embedding and output projection, and the Liger fused cross-entropy kernel to avoid ever materializing the full logit tensor during training. The entire model compiles cleanly with torch.compile in max-autotune mode — getting mHC’s Sinkhorn iterations and multi-stream mixing to play nicely with the Triton compiler was nontrivial, but the result is zero graph breaks.

Results

We trained both Goedel-mHC-1B and a standard transformer baseline on 20B tokens of FineWeb-Edu, under identical conditions: 8x NVIDIA H200 SXM GPUs, same data order, same token budget. The baseline uses standard GQA (no output gate), SwiGLU FFN, pre-norm residual connections, and AdamW with a cosine schedule — a conventional, well-tuned architecture at 1,185M parameters.

BenchmarkGoedel-mHC-1B (1.01B)Baseline (1.19B)
BPB (wikitext-2)1.0871.130
HellaSwag39.7%36.2%
ARC-Easy57.8%52.8%
ARC-Challenge24.3%23.9%
WinoGrande54.9%53.1%

3.8% better BPB, wins on all downstream benchmarks, with 15% fewer parameters.

Both models are trained on 20B tokens — well short of the 1-4T that modern 1B models see. At the same data and compute budget, the mHC stack gets better results with fewer parameters.

Cost

Total R&D cost was under $1,000, including failed runs. One researcher, rented H200s on Vast.ai, open-source tooling, and Claude Code as a development partner for infrastructure, debugging, and cloud orchestration.

What’s next

We are currently running Goedel-mHC-1B on 100B additional tokens of FineWeb-HQ as a continued pretraining phase, pushing toward BPB below 1.0. A full code release, technical writeup, and comprehensive downstream evaluation are coming when that run finishes. The training codebase — including the mHC implementation, Gated GQA, the registry system for combinatorial architecture exploration, and all the Vast.ai provisioning scripts — will be released alongside it.