Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Zarai Taraqiati Bank Limited ZTBL Job Announcement 2026 2026 Job Advertisement Pakistan

    March 9, 2026

    Origins Reborn [HOT!!!] FRESH Manifestation Offer

    March 9, 2026

    Punjab CM Maryam Nawaz unveils major plan to boost small industries

    March 9, 2026
    Facebook X (Twitter) Instagram
    Monday, March 9
    Trending
    • Zarai Taraqiati Bank Limited ZTBL Job Announcement 2026 2026 Job Advertisement Pakistan
    • Origins Reborn [HOT!!!] FRESH Manifestation Offer
    • Punjab CM Maryam Nawaz unveils major plan to boost small industries
    • Double super-over drama to Allen’s record ton: T20 World Cup moments
    • Oil costs soar over 25% as Iran warfare rattles international markets
    • Rajab Butt announces end of marriage With Emaan Fatima
    • Cardano Founder Says Pentad Faces $40 Million Shortfall
    • How to Beat Test Subject in Slay The Spire 2
    • Positions Vacant at Women University 2026 Job Advertisement Pakistan
    • Most Canadians want former prince Andrew removed from line to throne: poll – National
    Facebook X (Twitter) Instagram Pinterest Vimeo
    The News92The News92
    • Home
    • World
    • National
    • Sports
    • Crypto
    • Travel
    • Lifestyle
    • Jobs
    • Insurance
    • Gaming
    • AI & Tech
    • Health & Fitness
    The News92The News92
    Home - AI & Tech - A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation
    AI & Tech

    A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation

    Naveed AhmadBy Naveed AhmadMarch 9, 2026No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    In this tutorial, we build a complete pipeline for single-cell RNA sequencing analysis using Scanpy. We start by installing the required libraries and loading the PBMC 3k dataset, then perform quality control, filtering, and normalization to prepare the data for downstream analysis. We then identify highly variable genes, perform PCA for dimensionality reduction, and construct a neighborhood graph to generate UMAP embeddings and Leiden clusters. Through marker gene discovery and visualization, we explore how clusters correspond to biological cell populations and implement a simple rule-based annotation strategy to infer cell types.

    Copy CodeCopiedUse a different Browser
    import sys
    import subprocess
    import importlib
    
    
    def pip_install(*packages):
       subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *packages])
    
    
    required = [
       "scanpy",
       "anndata",
       "leidenalg",
       "igraph",
       "harmonypy",
       "seaborn"
    ]
    pip_install(*required)
    
    
    import os
    import warnings
    warnings.filterwarnings("ignore")
    
    
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import scanpy as sc
    import anndata as ad
    
    
    sc.settings.verbosity = 2
    sc.settings.set_figure_params(dpi=110, facecolor="white", frameon=False)
    np.random.seed(42)
    
    
    print("Scanpy version:", sc.__version__)
    
    
    adata = sc.datasets.pbmc3k()
    adata.var_names_make_unique()
    
    
    print("\nInitial AnnData:")
    print(adata)
    
    
    adata.layers["counts"] = adata.X.copy()
    
    
    adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-")
    sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True)
    
    
    print("\nQC summary:")
    display(
       adata.obs[["n_genes_by_counts", "total_counts", "pct_counts_mt"]].describe().T
    )
    

    We install all required dependencies and import the core scientific computing libraries needed for the analysis. We configure Scanpy settings, initialize the environment, and load the PBMC 3k single-cell RNA-seq dataset. We then compute quality-control metrics, including mitochondrial gene percentage, total counts, and the number of detected genes, for each cell.

    Copy CodeCopiedUse a different Browser
    fig, axs = plt.subplots(1, 3, figsize=(15, 4))
    sc.pl.violin(adata, ["n_genes_by_counts"], jitter=0.4, ax=axs[0], show=False)
    sc.pl.violin(adata, ["total_counts"], jitter=0.4, ax=axs[1], show=False)
    sc.pl.violin(adata, ["pct_counts_mt"], jitter=0.4, ax=axs[2], show=False)
    plt.tight_layout()
    plt.show()
    
    
    sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts", color="pct_counts_mt")
    
    
    adata = adata[adata.obs["n_genes_by_counts"] >= 200].copy()
    adata = adata[adata.obs["n_genes_by_counts"] <= 5000].copy()
    adata = adata[adata.obs["pct_counts_mt"] < 10].copy()
    sc.pp.filter_genes(adata, min_cells=3)
    
    
    print("\nAfter filtering:")
    print(adata)
    
    
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    adata.raw = adata.copy()
    
    
    sc.pp.highly_variable_genes(
       adata,
       flavor="seurat",
       min_mean=0.0125,
       max_mean=3,
       min_disp=0.5
    )
    
    
    print("\nHighly variable genes selected:", int(adata.var["highly_variable"].sum()))
    
    
    sc.pl.highly_variable_genes(adata)
    
    
    adata = adata[:, adata.var["highly_variable"]].copy()

    We visualize quality control metrics using plots to check the distribution of gene counts and mitochondrial content. We apply filtering steps to remove low-quality cells and genes that do not meet basic expression thresholds. We then normalize the data, apply a log transformation, and identify highly variable genes for downstream analysis.

    Copy CodeCopiedUse a different Browser
    sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
    sc.pp.scale(adata, max_value=10)
    
    
    sc.tl.pca(adata, svd_solver="arpack")
    sc.pl.pca_variance_ratio(adata, log=True)
    sc.pl.pca(adata, color=None)
    
    
    sc.pp.neighbors(adata, n_neighbors=12, n_pcs=30, metric="euclidean")
    sc.tl.umap(adata, min_dist=0.35, spread=1.0)
    sc.tl.leiden(adata, resolution=0.6, key_added="leiden")
    
    
    print("\nCluster counts:")
    display(adata.obs["leiden"].value_counts().sort_index().rename("cells_per_cluster").to_frame())
    
    
    sc.pl.umap(adata, color=["leiden"], legend_loc="on data", title="PBMC 3k - Leiden clusters")
    
    
    sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon")
    sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
    
    
    marker_table = sc.get.rank_genes_groups_df(adata, group=None)
    print("\nTop marker rows:")
    display(marker_table.head(20))

    We regress out technical confounders and scale the dataset to prepare it for dimensionality reduction. We perform principal component analysis to capture the dataset’s major variance structure. We then construct the neighborhood graph, compute UMAP embeddings, perform Leiden clustering, and identify marker genes for each cluster.

    Copy CodeCopiedUse a different Browser
    top_markers_per_cluster = (
       marker_table.groupby("group")
       .head(10)
       .loc[:, ["group", "names", "logfoldchanges", "pvals_adj"]]
       .reset_index(drop=True)
    )
    print("\nTop 10 markers per cluster:")
    display(top_markers_per_cluster)
    
    
    candidate_markers = [
       "IL7R", "LTB", "MALAT1", "CCR7",
       "NKG7", "GNLY", "PRF1",
       "MS4A1", "CD79A", "CD79B",
       "LYZ", "S100A8", "FCER1A", "CST3",
       "PPBP", "FCGR3A", "LGALS3", "CTSS",
       "CD3D", "TRBC1", "TRAC"
    ]
    candidate_markers = [g for g in candidate_markers if g in adata.var_names]
    
    
    if candidate_markers:
       sc.pl.dotplot(
           adata,
           var_names=candidate_markers,
           groupby="leiden",
           standard_scale="var",
           dendrogram=True
       )
       sc.pl.matrixplot(
           adata,
           var_names=candidate_markers,
           groupby="leiden",
           standard_scale="var",
           dendrogram=True
       )
    
    
    cluster_marker_reference = {
       "T_cells": ["IL7R", "LTB", "CCR7", "CD3D", "TRBC1", "TRAC"],
       "NK_cells": ["NKG7", "GNLY", "PRF1"],
       "B_cells": ["MS4A1", "CD79A", "CD79B"],
       "Monocytes": ["LYZ", "FCGR3A", "LGALS3", "CTSS", "S100A8", "CST3"],
       "Dendritic_cells": ["FCER1A", "CST3"],
       "Platelets": ["PPBP"]
    }
    

    We examine the most significant marker genes detected for each cluster and summarize the top markers. We visualize gene expression patterns across clusters using dot plots and matrix plots for known immune cell markers. We also define a reference mapping of marker genes associated with major immune cell types for later annotation.

    Copy CodeCopiedUse a different Browser
    available_reference = {
       celltype: [g for g in genes if g in adata.var_names]
       for celltype, genes in cluster_marker_reference.items()
    }
    available_reference = {k: v for k, v in available_reference.items() if len(v) > 0}
    
    
    for celltype, genes in available_reference.items():
       sc.tl.score_genes(adata, gene_list=genes, score_name=f"{celltype}_score", use_raw=False)
    
    
    score_cols = [f"{ct}_score" for ct in available_reference.keys()]
    cluster_scores = adata.obs.groupby("leiden")[score_cols].mean()
    display(cluster_scores)
    
    
    cluster_to_celltype = {}
    for cluster in cluster_scores.index:
       best = cluster_scores.loc[cluster].idxmax().replace("_score", "")
       cluster_to_celltype[cluster] = best
    
    
    adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype).astype("category")
    
    
    print("\nCluster to cell-type mapping:")
    display(pd.DataFrame.from_dict(cluster_to_celltype, orient="index", columns=["assigned_cell_type"]))
    
    
    sc.pl.umap(
       adata,
       color=["leiden", "cell_type"],
       legend_loc="on data",
       wspace=0.45
    )
    
    
    sc.tl.rank_genes_groups(adata, groupby="cell_type", method="wilcoxon")
    sc.pl.rank_genes_groups(adata, n_genes=15, sharey=False)
    
    
    celltype_markers = sc.get.rank_genes_groups_df(adata, group=None)
    print("\nTop markers by annotated cell type:")
    display(
       celltype_markers.groupby("group").head(8)[["group", "names", "logfoldchanges", "pvals_adj"]]
    )
    
    
    cluster_prop = (
       adata.obs["cell_type"]
       .value_counts(normalize=True)
       .mul(100)
       .round(2)
       .rename("percent")
       .to_frame()
    )
    print("\nCell-type proportions (%):")
    display(cluster_prop)
    
    
    plt.figure(figsize=(7, 4))
    cluster_prop["percent"].sort_values().plot(kind="barh")
    plt.xlabel("Percent of cells")
    plt.ylabel("Cell type")
    plt.title("Estimated cell-type composition")
    plt.tight_layout()
    plt.show()
    
    
    output_dir = "scanpy_pbmc3k_outputs"
    os.makedirs(output_dir, exist_ok=True)
    
    
    adata.write(os.path.join(output_dir, "pbmc3k_scanpy_advanced.h5ad"))
    marker_table.to_csv(os.path.join(output_dir, "cluster_markers.csv"), index=False)
    celltype_markers.to_csv(os.path.join(output_dir, "celltype_markers.csv"), index=False)
    cluster_scores.to_csv(os.path.join(output_dir, "cluster_score_matrix.csv"))
    
    
    print(f"\nSaved outputs to: {output_dir}")
    print("Files:")
    for f in sorted(os.listdir(output_dir)):
       print(" -", f)
    
    
    summary = {
       "n_cells_final": int(adata.n_obs),
       "n_genes_final": int(adata.n_vars),
       "n_clusters": int(adata.obs["leiden"].nunique()),
       "clusters": sorted(adata.obs["leiden"].unique().tolist()),
       "cell_types": sorted(adata.obs["cell_type"].unique().tolist()),
    }
    print("\nAnalysis summary:")
    for k, v in summary.items():
       print(f"{k}: {v}")

    We score each cell using known marker gene sets and assign probable cell types to clusters based on expression patterns. We visualize the annotated cell types on the UMAP embedding and perform differential gene expression analysis across the predicted cell populations. Also, we compute cell-type proportions, generate summary visualizations, and save the processed dataset and analysis outputs for further research.

    In conclusion, we developed a full end-to-end workflow for analyzing single-cell transcriptomic data using Scanpy. We performed preprocessing, clustering, marker-gene analysis, and cell-type annotation, and visualized the data structure using UMAP and gene expression plots. By saving the processed AnnData object and analysis outputs, we created a reusable dataset for further biological interpretation and advanced modeling. This workflow demonstrates how Scanpy enables scalable, reproducible single-cell analysis through a structured, modular Python pipeline.


    Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    The post A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation appeared first on MarkTechPost.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHong Kong, China shares tumble as Iran battle knocks Asian markets
    Next Article Tennis-Gauff retires from Indian Wells match as a consequence of arm ache
    Naveed Ahmad
    • Website
    • Tumblr

    Related Posts

    AI & Tech

    Ring’s Jamie Siminoff has been trying to calm privacy fears since the Super Bowl, but his answers may not help

    March 9, 2026
    AI & Tech

    Andrej Karpathy Open-Sources ‘Autoresearch’: A 630-Line Python Tool Letting AI Agents Run Autonomous ML Experiments on Single GPUs

    March 9, 2026
    AI & Tech

    Palmer Luckey’s retro gaming startup ModRetro reportedly seeks funding at $1B valuation

    March 9, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    Zarai Taraqiati Bank Limited ZTBL Job Announcement 2026 2026 Job Advertisement Pakistan

    March 9, 20260 Views

    Origins Reborn [HOT!!!] FRESH Manifestation Offer

    March 9, 20260 Views

    Punjab CM Maryam Nawaz unveils major plan to boost small industries

    March 9, 20260 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Demo
    Most Popular

    Zarai Taraqiati Bank Limited ZTBL Job Announcement 2026 2026 Job Advertisement Pakistan

    March 9, 20260 Views

    Origins Reborn [HOT!!!] FRESH Manifestation Offer

    March 9, 20260 Views

    Punjab CM Maryam Nawaz unveils major plan to boost small industries

    March 9, 20260 Views
    Our Picks

    Zarai Taraqiati Bank Limited ZTBL Job Announcement 2026 2026 Job Advertisement Pakistan

    March 9, 2026

    Origins Reborn [HOT!!!] FRESH Manifestation Offer

    March 9, 2026

    Punjab CM Maryam Nawaz unveils major plan to boost small industries

    March 9, 2026

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Advertise
    • Disclaimer
    © 2026 TheNews92.com. All Rights Reserved. Unauthorized reproduction or redistribution of content is strictly prohibited.

    Type above and press Enter to search. Press Esc to cancel.