Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    How Good Airport Taxi Options Enhance Enterprise Journey Effectivity for UK SMEs

    February 13, 2026

    Celebrities shine amid festive season

    February 13, 2026

    SEC Chair Confirms Crypto Taxonomy Steering In Line With CLARITY Act Framework

    February 13, 2026
    Facebook X (Twitter) Instagram
    Friday, February 13
    Trending
    • How Good Airport Taxi Options Enhance Enterprise Journey Effectivity for UK SMEs
    • Celebrities shine amid festive season
    • SEC Chair Confirms Crypto Taxonomy Steering In Line With CLARITY Act Framework
    • Constructing Electrician & AC Technician Jobs 2026 in Qatar 2026 Job Commercial Pakistan
    • New Konami JRPG Is Rev.Noir
    • Colombo in frenzy as Pak-India T20 conflict sells out
    • Waymo is asking DoorDash drivers to close the doorways of its self-driving vehicles
    • Alberta Sheriffs proceed to take away unsafe business automobiles from roadways
    • Anthropic clinches $380 billion valuation after $30 billion funding spherical
    • InnaPeace™ – Official – Brainwave Analysis UK
    Facebook X (Twitter) Instagram Pinterest Vimeo
    The News92The News92
    • Home
    • World
    • National
    • Sports
    • Crypto
    • Travel
    • Lifestyle
    • Jobs
    • Insurance
    • Gaming
    • AI & Tech
    • Health & Fitness
    The News92The News92
    Home - AI & Tech - The best way to Align Giant Language Fashions with Human Preferences Utilizing Direct Desire Optimization, QLoRA, and Extremely-Suggestions
    AI & Tech

    The best way to Align Giant Language Fashions with Human Preferences Utilizing Direct Desire Optimization, QLoRA, and Extremely-Suggestions

    Naveed AhmadBy Naveed AhmadFebruary 13, 2026No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    The best way to Align Giant Language Fashions with Human Preferences Utilizing Direct Desire Optimization, QLoRA, and Extremely-Suggestions
    Share
    Facebook Twitter LinkedIn Pinterest Email


    On this tutorial, we implement an end-to-end Direct Desire Optimization workflow to align a big language mannequin with human preferences with out utilizing a reward mannequin. We mix TRL’s DPOTrainer with QLoRA and PEFT to make preference-based alignment possible on a single Colab GPU. We practice immediately on the UltraFeedback binarized dataset, the place every immediate has a selected and a rejected response, permitting us to form mannequin conduct and elegance somewhat than simply factual recall.

    import os
    import math
    import random
    import torch
    
    
    !pip -q set up -U "transformers>=4.45.0" "datasets>=2.19.0" "speed up>=0.33.0" "trl>=0.27.0" "peft>=0.12.0" "bitsandbytes>=0.43.0" "sentencepiece" "consider"
    
    
    SEED = 42
    random.seed(SEED)
    torch.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
    
    
    MODEL_NAME = os.environ.get("MODEL_NAME", "Qwen/Qwen2-0.5B-Instruct")
    DATASET_NAME = "HuggingFaceH4/ultrafeedback_binarized"
    OUTPUT_DIR = "dpo_ultrafeedback_qlora"
    
    
    MAX_TRAIN_SAMPLES = 8000
    MAX_EVAL_SAMPLES  = 200
    MAX_PROMPT_LEN = 512
    MAX_COMPLETION_LEN = 256
    
    
    BETA = 0.1
    LR = 2e-4
    EPOCHS = 1
    PER_DEVICE_BS = 2
    GRAD_ACCUM = 8
    
    
    LOGGING_STEPS = 10
    SAVE_STEPS = 200
    
    
    machine = "cuda" if torch.cuda.is_available() else "cpu"
    print("Machine:", machine, "GPU:", torch.cuda.get_device_name(0) if machine == "cuda" else "None")

    We arrange the execution setting and set up all required libraries for DPO, PEFT, and quantized coaching. We outline all international hyperparameters, dataset limits, and optimization settings in a single place. We additionally initialize the random quantity generator and ensure GPU availability to make sure reproducible runs.

    from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
    
    
    bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_use_double_quant=True,
       bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16,
    )
    
    
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
    if tokenizer.pad_token is None:
       tokenizer.pad_token = tokenizer.eos_token
    
    
    mannequin = AutoModelForCausalLM.from_pretrained(
       MODEL_NAME,
       quantization_config=bnb_config,
       torch_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16,
       device_map="auto",
    )
    mannequin.config.use_cache = False

    We load the tokenizer and the bottom language mannequin utilizing 4-bit quantization to reduce reminiscence utilization. We configure bitsandbytes to allow environment friendly QLoRA-style computation on Colab GPUs. We put together the mannequin for coaching by disabling cache utilization to keep away from incompatibilities throughout backpropagation.

    from peft import LoraConfig, get_peft_model
    
    
    lora_config = LoraConfig(
       r=16,
       lora_alpha=32,
       lora_dropout=0.05,
       bias="none",
       task_type="CAUSAL_LM",
       target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
    )
    
    
    mannequin = get_peft_model(mannequin, lora_config)
    mannequin.print_trainable_parameters()
    
    
    mannequin.gradient_checkpointing_enable()

    We connect LoRA adapters to the mannequin’s consideration and feed-forward projection layers. We limit coaching to a small set of parameters to make fine-tuning environment friendly and steady. We allow gradient checkpointing to additional scale back GPU reminiscence consumption throughout coaching.

    from datasets import load_dataset
    
    
    ds = load_dataset(DATASET_NAME)
    
    
    train_split = "train_prefs" if "train_prefs" in ds else ("practice" if "practice" in ds else listing(ds.keys())[0])
    test_split  = "test_prefs" if "test_prefs" in ds else ("check" if "check" in ds else None)
    
    
    train_raw = ds[train_split]
    test_raw = ds[test_split] if test_split is just not None else None
    
    
    print("Splits:", ds.keys())
    print("Utilizing practice break up:", train_split, "measurement:", len(train_raw))
    if test_raw is just not None:
       print("Utilizing check break up:", test_split, "measurement:", len(test_raw))
    
    
    def _extract_last_user_and_assistant(messages):
       last_user_idx = None
       last_asst_idx = None
       for i, m in enumerate(messages):
           if m.get("position") == "person":
               last_user_idx = i
           if m.get("position") == "assistant":
               last_asst_idx = i
    
    
       if last_user_idx is None or last_asst_idx is None:
           return None, None
    
    
       prompt_messages = messages[: last_user_idx + 1]
       assistant_text = messages[last_asst_idx].get("content material", "")
       return prompt_messages, assistant_text
    
    
    def format_example(ex):
       chosen_msgs = ex["chosen"]
       rejected_msgs = ex["rejected"]
    
    
       prompt_msgs_c, chosen_text = _extract_last_user_and_assistant(chosen_msgs)
       prompt_msgs_r, rejected_text = _extract_last_user_and_assistant(rejected_msgs)
    
    
       if prompt_msgs_c is None or prompt_msgs_r is None:
           return {"immediate": None, "chosen": None, "rejected": None}
    
    
       prompt_text = tokenizer.apply_chat_template(
           prompt_msgs_c, tokenize=False, add_generation_prompt=True
       )
    
    
       return {
           "immediate": prompt_text,
           "chosen": chosen_text.strip(),
           "rejected": rejected_text.strip(),
       }
    
    
    train_raw = train_raw.shuffle(seed=SEED)
    train_raw = train_raw.choose(vary(min(MAX_TRAIN_SAMPLES, len(train_raw))))
    
    
    train_ds = train_raw.map(format_example, remove_columns=train_raw.column_names)
    train_ds = train_ds.filter(lambda x: x["prompt"] is just not None and len(x["chosen"]) > 0 and len(x["rejected"]) > 0)
    
    
    if test_raw is just not None:
       test_raw = test_raw.shuffle(seed=SEED)
       test_raw = test_raw.choose(vary(min(MAX_EVAL_SAMPLES, len(test_raw))))
       eval_ds = test_raw.map(format_example, remove_columns=test_raw.column_names)
       eval_ds = eval_ds.filter(lambda x: x["prompt"] is just not None and len(x["chosen"]) > 0 and len(x["rejected"]) > 0)
    else:
       eval_ds = None
    
    
    print("Prepare examples:", len(train_ds), "Eval examples:", len(eval_ds) if eval_ds is just not None else 0)
    print(train_ds[0])

    We load the UltraFeedback binarized dataset and dynamically choose the suitable practice and check splits. We extract immediate, chosen, and rejected responses from multi-turn conversations and format them utilizing the mannequin’s chat template. We shuffle, filter, and subsample the info to create clear and environment friendly coaching and analysis datasets.

    from trl import DPOTrainer, DPOConfig
    
    
    use_bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8
    use_fp16 = torch.cuda.is_available() and never use_bf16
    
    
    training_args = DPOConfig(
       output_dir=OUTPUT_DIR,
       beta=BETA,
       per_device_train_batch_size=PER_DEVICE_BS,
       gradient_accumulation_steps=GRAD_ACCUM,
       num_train_epochs=EPOCHS,
       learning_rate=LR,
       lr_scheduler_type="cosine",
       warmup_ratio=0.05,
       logging_steps=LOGGING_STEPS,
       save_steps=SAVE_STEPS,
       save_total_limit=2,
       bf16=use_bf16,
       fp16=use_fp16,
       optim="paged_adamw_8bit",
       max_length=MAX_PROMPT_LEN + MAX_COMPLETION_LEN,
       max_prompt_length=MAX_PROMPT_LEN,
       report_to="none",
    )
    
    
    coach = DPOTrainer(
       mannequin=mannequin,
       args=training_args,
       processing_class=tokenizer,
       train_dataset=train_ds,
       eval_dataset=eval_ds,
    )
    
    
    coach.practice()
    
    
    coach.save_model(OUTPUT_DIR)
    tokenizer.save_pretrained(OUTPUT_DIR)
    
    
    print("Saved to:", OUTPUT_DIR)

    We configure the DPO coaching goal with rigorously chosen optimization and scheduling parameters. We initialize the DPOTrainer to immediately optimize desire pairs and not using a reward mannequin. We practice the LoRA adapters and save the aligned mannequin artifacts for later inference.

    from peft import PeftModel
    from transformers import pipeline
    
    
    def generate_text(model_for_gen, immediate, max_new_tokens=180):
       model_for_gen.eval()
       inputs = tokenizer(immediate, return_tensors="pt", truncation=True, max_length=MAX_PROMPT_LEN).to(model_for_gen.machine)
       with torch.no_grad():
           out = model_for_gen.generate(
               **inputs,
               max_new_tokens=max_new_tokens,
               do_sample=True,
               temperature=0.7,
               top_p=0.95,
               pad_token_id=tokenizer.eos_token_id,
           )
       return tokenizer.decode(out[0], skip_special_tokens=True)
    
    
    base_model = AutoModelForCausalLM.from_pretrained(
       MODEL_NAME,
       quantization_config=bnb_config,
       torch_dtype=torch.bfloat16 if use_bf16 else torch.float16,
       device_map="auto",
    )
    base_model.config.use_cache = True
    
    
    dpo_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
    dpo_model.config.use_cache = True
    
    
    sample_pool = eval_ds if eval_ds is just not None and len(eval_ds) > 0 else train_ds
    samples = [sample_pool[i] for i in random.pattern(vary(len(sample_pool)), okay=min(3, len(sample_pool)))]
    
    
    for i, ex in enumerate(samples, 1):
       immediate = ex["prompt"]
       print("n" + "="*90)
       print(f"Pattern #{i}")
       print("- Immediate:n", immediate)
    
    
       base_out = generate_text(base_model, immediate)
       dpo_out  = generate_text(dpo_model, immediate)
    
    
       print("n- Base mannequin output:n", base_out)
       print("n- DPO (LoRA) output:n", dpo_out)
    
    
    print("nDone.")

    We reload the bottom mannequin and fix the educated DPO LoRA adapters for inference. We generate responses from each the unique and aligned fashions utilizing the identical prompts for comparability. We qualitatively consider how desire optimization modifications mannequin conduct by inspecting the outputs aspect by aspect.

    In conclusion, we demonstrated how DPO gives a steady and environment friendly various to RLHF by immediately optimizing desire pairs with a easy, well-defined goal. We confirmed that parameter-efficient fine-tuning with LoRA and 4-bit quantization permits sensible experimentation even beneath tight compute constraints. We qualitatively validated alignment by evaluating generations earlier than and after DPO coaching, confirming that the mannequin learns to want higher-quality responses whereas remaining light-weight and deployable.


    Try the FULL CODES here. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAdaptive bikes stolen from Kelowna non-profit, leaving incapacity group reeling
    Next Article Rain risk looms over Pakistan vs India T20 World Cup blockbuster
    Naveed Ahmad
    • Website
    • Tumblr

    Related Posts

    AI & Tech

    Waymo is asking DoorDash drivers to close the doorways of its self-driving vehicles

    February 13, 2026
    AI & Tech

    Amid disappointing earnings, Pinterest claims it sees extra searches than ChatGPT

    February 13, 2026
    AI & Tech

    OpenAI’s President Gave Hundreds of thousands to Trump. He Says It’s for Humanity

    February 13, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    Oatly loses ‘milk’ branding battle in UK Supreme Courtroom

    February 12, 20261 Views

    ‘Fly excessive my angel’: 12-year-old lady dies by suicide amid bullying allegations

    February 7, 20261 Views

    Lenovo’s Qira is a Guess on Ambient, Cross-device AI—and on a New Type of Working System

    January 30, 20261 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Demo
    Most Popular

    Oatly loses ‘milk’ branding battle in UK Supreme Courtroom

    February 12, 20261 Views

    ‘Fly excessive my angel’: 12-year-old lady dies by suicide amid bullying allegations

    February 7, 20261 Views

    Lenovo’s Qira is a Guess on Ambient, Cross-device AI—and on a New Type of Working System

    January 30, 20261 Views
    Our Picks

    How Good Airport Taxi Options Enhance Enterprise Journey Effectivity for UK SMEs

    February 13, 2026

    Celebrities shine amid festive season

    February 13, 2026

    SEC Chair Confirms Crypto Taxonomy Steering In Line With CLARITY Act Framework

    February 13, 2026

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Advertise
    • Disclaimer
    © 2026 TheNews92.com. All Rights Reserved. Unauthorized reproduction or redistribution of content is strictly prohibited.

    Type above and press Enter to search. Press Esc to cancel.