What’s Tokenization Drift and The way to Repair It?
phrases = [p[1] for p in pairs] ids_ws = [tokenizer.encode(” ” + w, add_special_tokens=False)[0] for w in phrases] ids_nws = [tokenizer.encode(w, add_special_tokens=False)[0] for w in phrases] delta = [abs(a – b) for a, b in zip(ids_ws, ids_nws)] x = np.arange(len(phrases)) width = 0.35 fig, axes = plt.subplots(1, 2, figsize=(14, 5)) fig.patch.set_facecolor(“#FAFAF8”) # Left: side-by-side token…
