Translations
English translations of selected posts from Chinese-language AI/ML blogs. Google Translate works in a pinch but frequently mangles LaTeX rendering. Modern language models do better, so it can be nice to have cleaner translations of technical material.
Zhihu (知乎)
Zhihu (知乎, "Do you know?" in classical Chinese) is China's largest Q&A platform. Like Quora, but with a stronger culture of long-form technical writing.
-
Feb 27, 2026
Your DeepSeek mHC Might Not Need the "m"
Replacing the learned doubly stochastic H_res in DeepSeek's manifold Hyper-Connections with a plain identity matrix yields better results, while eliminating Sinkhorn-Knopp iterations entirely.
Scientific Spaces (科学空间) — Jianlin Su
Scientific Spaces (科学空间) is a Chinese-language AI/ML blog by Jianlin Su (苏剑林). Su is a prolific researcher and writer, and his blog is well-regarded in the Chinese ML community. Original content licensed under CC BY-NC-SA 4.0; translations shared under the same license.
-
Mar 1, 2026
Beyond MuP: 3. Special Cases, Special Treatment
Embedding layers, LM Heads, and RMS Norm parameters each need their own stability analysis. Starting from three stability metrics, we derive the right initialization and steepest descent optimizer for each — and explain why Muon doesn't apply to all matrix parameters.
-
Feb 14, 2026
Beyond MuP: 2. Linear Layers and Steepest Descent
Applying the three stability conditions to linear layers recovers MuP initialization and the Muon optimizer from first principles, via spectral norm analysis and steepest descent.
-
Nov 18, 2025
Muon Optimizer Guide: Quick Start and Key Details
A practical guide to switching from Adam to Muon, covering the four variants, dimension ordering pitfalls, and hyperparameter conversion rules.
-
Oct 20, 2025
Beyond MuP: 1. Three Characteristics of Good Models
What does it mean for a model to be 'good'? Three stability conditions — forward, dependency, and update — form the foundation for understanding MuP, Muon, and principled model optimization.
-
Oct 4, 2025
Why Does Linear Attention Need Short Conv?
Short convolutions on K in linear attention transform the TTT training objective from trivial self-prediction to next-token prediction, enabling meaningful memorization of the KV cache.
-
Jun 19, 2025
A Brief History of Linear Attention: From Imitation and Innovation to Feeding Back
Tracing linear attention from its origins as an approximation of softmax attention, through forget gates and test-time training, to DeltaNet and its recent feedback into softmax attention via DeltaFormer and PaTH.
-
Mar 22, 2021
Transformer Upgrade Path: 2. Rotary Position Embedding, the Best of Both Worlds
Deriving Rotary Position Embedding (RoPE) from first principles: an absolute encoding that achieves relative position awareness through complex-number rotation, with long-range decay and compatibility with linear attention.
-
Feb 2, 2021
Transformer Position Encodings That Rack Researchers' Brains
A survey of position encoding schemes for Transformers -- trainable, sinusoidal, recurrent, multiplicative, relative (classic, XLNet, T5, DeBERTa), CNN-based, complex-valued, and a fusion approach that previews RoPE.
Contains the first public sketch of what would become Rotary Position Embedding (RoPE).