Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It
BG = “#fafaf8” DARK = “#1a1a1a” # Color ramp: blue for common tokens, red for rare TOKEN_COLORS = [“#1a5276”, “#2471a3”, “#5dade2”, “#e67e22”, “#c0392b”, “#7d2a2a”] steps = np.arange(N_STEPS) fig = plt.figure(figsize=(16, 11), facecolor=BG) fig.suptitle( “SGD vs. Adam on Rare Tokens — Frequency Bias and Variance Normalization”, fontsize=14, fontweight=”bold”, color=DARK, y=0.99 ) gs = gridspec.GridSpec(2, 3, figure=fig,…
