Constructing a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken

Constructing a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken

fig, ax = plt.subplots(2, 2, figsize=(14, 9)) lang_counts.head(12).iloc[::-1].plot.barh(ax=ax[0, 0], coloration=”#76b900″) ax[0, 0].set_title(“High 12 languages (pattern)”); ax[0, 0].set_xlabel(“information”) df[“ext”].value_counts().head(12).iloc[::-1].plot.barh(ax=ax[0, 1], coloration=”#5b8def”) ax[0, 1].set_title(“High 12 file extensions (pattern)”); ax[0, 1].set_xlabel(“information”) df[“depth”].clip(higher=12).plot.hist(bins=vary(0, 14), ax=ax[1, 0], coloration=”#f4a261″, edgecolor=”white”) ax[1, 0].set_title(“Listing nesting depth”); ax[1, 0].set_xlabel(“‘/’ depend in path”) (df[“repo”].value_counts().head(10).iloc[::-1] .plot.barh(ax=ax[1, 1], coloration=”#9b5de5″)) ax[1, 1].set_title(“Commonest repos (pattern)”); ax[1, 1].set_xlabel(“information”) plt.tight_layout(); plt.present()…

Read More