A Coding Implementation to Run Qwen3.5 Reasoning Fashions Distilled with Claude-Model Considering Utilizing GGUF and 4-Bit Quantization

A Coding Implementation to Run Qwen3.5 Reasoning Fashions Distilled with Claude-Model Considering Utilizing GGUF and 4-Bit Quantization


On this tutorial, we work immediately with Qwen3.5 fashions distilled with Claude-style reasoning and arrange a Colab pipeline that lets us swap between a 27B GGUF variant and a light-weight 2B 4-bit model with a single flag. We begin by validating GPU availability, then conditionally set up both llama.cpp or transformers with bitsandbytes, relying on the chosen path. Each branches are unified by means of shared generate_fn and stream_fn interfaces, guaranteeing constant inference throughout backends. We additionally implement a ChatSession class for multi-turn interplay and construct utilities to parse traces, permitting us to explicitly separate reasoning from ultimate outputs throughout execution.

MODEL_PATH = "2B_HF"


import torch


if not torch.cuda.is_available():
   increase RuntimeError(
       "❌ No GPU! Go to Runtime → Change runtime sort → T4 GPU."
   )


gpu_name = torch.cuda.get_device_name(0)
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"✅ GPU: {gpu_name} — {vram_gb:.1f} GB VRAM")


import subprocess, sys, os, re, time


generate_fn = None
stream_fn = None

We initialize the execution by setting the mannequin path flag and checking whether or not a GPU is out there on the system. We retrieve and print the GPU identify together with out there VRAM to make sure the surroundings meets the necessities. We additionally import all required base libraries and outline placeholders for the unified era capabilities that will likely be assigned later.

if MODEL_PATH == "27B_GGUF":
   print("n📦 Putting in llama-cpp-python with CUDA (takes 3-5 min)...")
   env = os.environ.copy()
   env["CMAKE_ARGS"] = "-DGGML_CUDA=on"
   subprocess.check_call(
       [sys.executable, "-m", "pip", "install", "-q", "llama-cpp-python", "huggingface_hub"],
       env=env,
   )
   print("✅ Put in.n")


   from huggingface_hub import hf_hub_download
   from llama_cpp import Llama


   GGUF_REPO = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF"
   GGUF_FILE = "Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-Q4_K_M.gguf"


   print(f"⏳ Downloading {GGUF_FILE} (~16.5 GB)... seize a espresso ☕")
   model_path = hf_hub_download(repo_id=GGUF_REPO, filename=GGUF_FILE)
   print(f"✅ Downloaded: {model_path}n")


   print("⏳ Loading into llama.cpp (GPU offload)...")
   llm = Llama(
       model_path=model_path,
       n_ctx=8192,
       n_gpu_layers=40,
       n_threads=4,
       verbose=False,
   )
   print("✅ 27B GGUF mannequin loaded!n")


   def generate_fn(
       immediate, system_prompt="You're a useful assistant. Assume step-by-step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95, **kwargs
   ):
       output = llm.create_chat_completion(
           messages=[
               {"role": "system", "content": system_prompt},
               {"role": "user", "content": prompt},
           ],
           max_tokens=max_new_tokens,
           temperature=temperature,
           top_p=top_p,
       )
       return output["choices"][0]["message"]["content"]


   def stream_fn(
       immediate, system_prompt="You're a useful assistant. Assume step-by-step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
   ):
       print("⏳ Streaming output:n")
       for chunk in llm.create_chat_completion(
           messages=[
               {"role": "system", "content": system_prompt},
               {"role": "user", "content": prompt},
           ],
           max_tokens=max_new_tokens,
           temperature=temperature,
           top_p=top_p,
           stream=True,
       ):
           delta = chunk["choices"][0].get("delta", {})
           textual content = delta.get("content material", "")
           if textual content:
               print(textual content, finish="", flush=True)
       print()


   class ChatSession:
       def __init__(self, system_prompt="You're a useful assistant. Assume step-by-step."):
           self.messages = [{"role": "system", "content": system_prompt}]
       def chat(self, user_message, temperature=0.6):
           self.messages.append({"position": "consumer", "content material": user_message})
           output = llm.create_chat_completion(
               messages=self.messages, max_tokens=2048,
               temperature=temperature, top_p=0.95,
           )
           resp = output["choices"][0]["message"]["content"]
           self.messages.append({"position": "assistant", "content material": resp})
           return resp

We deal with the 27B GGUF path by putting in llama.cpp with CUDA assist and downloading the Qwen3.5 27B distilled mannequin from Hugging Face. We load the mannequin with GPU offloading and outline a standardized generate_fn and stream_fn for inference and streaming outputs. We additionally implement a ChatSession class to take care of dialog historical past for multi-turn interactions.

elif MODEL_PATH == "2B_HF":
   print("n📦 Putting in transformers + bitsandbytes...")
   subprocess.check_call([
       sys.executable, "-m", "pip", "install", "-q",
       "transformers @ git+https://github.com/huggingface/transformers.git@main",
       "accelerate", "bitsandbytes", "sentencepiece", "protobuf",
   ])
   print("✅ Put in.n")


   from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer


   HF_MODEL_ID = "Jackrong/Qwen3.5-2B-Claude-4.6-Opus-Reasoning-Distilled"


   bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_compute_dtype=torch.bfloat16,
       bnb_4bit_use_double_quant=True,
   )


   print(f"⏳ Loading {HF_MODEL_ID} in 4-bit...")
   tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_ID, trust_remote_code=True)
   mannequin = AutoModelForCausalLM.from_pretrained(
       HF_MODEL_ID,
       quantization_config=bnb_config,
       device_map="auto",
       trust_remote_code=True,
       torch_dtype=torch.bfloat16,
   )
   print(f"✅ Mannequin loaded! Reminiscence: {mannequin.get_memory_footprint() / 1e9:.2f} GBn")


   def generate_fn(
       immediate, system_prompt="You're a useful assistant. Assume step-by-step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
       repetition_penalty=1.05, do_sample=True, **kwargs
   ):
       messages = [
           {"role": "system", "content": system_prompt},
           {"role": "user", "content": prompt},
       ]
       textual content = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
       inputs = tokenizer(textual content, return_tensors="pt").to(mannequin.gadget)
       with torch.no_grad():
           output_ids = mannequin.generate(
               **inputs, max_new_tokens=max_new_tokens, temperature=temperature,
               top_p=top_p, repetition_penalty=repetition_penalty, do_sample=do_sample,
           )
       generated = output_ids[0][inputs["input_ids"].form[1]:]
       return tokenizer.decode(generated, skip_special_tokens=True)


   def stream_fn(
       immediate, system_prompt="You're a useful assistant. Assume step-by-step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
   ):
       messages = [
           {"role": "system", "content": system_prompt},
           {"role": "user", "content": prompt},
       ]
       textual content = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
       inputs = tokenizer(textual content, return_tensors="pt").to(mannequin.gadget)
       streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
       print("⏳ Streaming output:n")
       with torch.no_grad():
           mannequin.generate(
               **inputs, max_new_tokens=max_new_tokens, temperature=temperature,
               top_p=top_p, do_sample=True, streamer=streamer,
           )


   class ChatSession:
       def __init__(self, system_prompt="You're a useful assistant. Assume step-by-step."):
           self.messages = [{"role": "system", "content": system_prompt}]
       def chat(self, user_message, temperature=0.6):
           self.messages.append({"position": "consumer", "content material": user_message})
           textual content = tokenizer.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
           inputs = tokenizer(textual content, return_tensors="pt").to(mannequin.gadget)
           with torch.no_grad():
               output_ids = mannequin.generate(
                   **inputs, max_new_tokens=2048, temperature=temperature, top_p=0.95, do_sample=True,
               )
           generated = output_ids[0][inputs["input_ids"].form[1]:]
           resp = tokenizer.decode(generated, skip_special_tokens=True)
           self.messages.append({"position": "assistant", "content material": resp})
           return resp
else:
   increase ValueError("MODEL_PATH have to be '27B_GGUF' or '2B_HF'")

We implement the light-weight 2B path utilizing transformers with 4-bit quantization by means of bitsandbytes. We load the Qwen3.5 2B distilled mannequin effectively onto the GPU and configure era parameters for managed sampling. We once more outline unified era, streaming, and chat session logic in order that each mannequin paths behave identically throughout execution.

def parse_thinking(response: str) -> tuple:
   m = re.search(r"(.*?)", response, re.DOTALL)
   if m:
       return m.group(1).strip(), response[m.end():].strip()
   return "", response.strip()




def display_response(response: str):
   pondering, reply = parse_thinking(response)
   if pondering:
       print("🧠 THINKING:")
       print("-" * 60)
       print(pondering[:1500] + ("n... [truncated]" if len(pondering) > 1500 else ""))
       print("-" * 60)
   print("n💬 ANSWER:")
   print(reply)




print("✅ All helpers prepared. Working checks...n")

We outline helper capabilities to extract reasoning traces enclosed inside tags and separate them from ultimate solutions. We create a show utility that codecs and prints each the pondering course of and the response in a structured means. This enables us to examine how the Qwen-based mannequin causes internally throughout era.

print("=" * 70)
print("📝 TEST 1: Fundamental reasoning")
print("=" * 70)


response = generate_fn(
   "If I've 3 apples and provides away half, then purchase 5 extra, what number of do I've? "
   "Clarify your reasoning."
)
display_response(response)


print("n" + "=" * 70)
print("📝 TEST 2: Streaming output")
print("=" * 70)


stream_fn(
   "Clarify the distinction between concurrency and parallelism. "
   "Give a real-world analogy for every."
)


print("n" + "=" * 70)
print("📝 TEST 3: Considering ON vs OFF")
print("=" * 70)


query = "What's the capital of France?"


print("n--- Considering ON (default) ---")
resp = generate_fn(query)
display_response(resp)


print("n--- Considering OFF (concise) ---")
resp = generate_fn(
   query,
   system_prompt="Reply immediately and concisely. Don't use  tags.",
   max_new_tokens=256,
)
display_response(resp)


print("n" + "=" * 70)
print("📝 TEST 4: Bat & ball trick query")
print("=" * 70)


response = generate_fn(
   "A bat and a ball price $1.10 in complete. "
   "How a lot does the ball price? Present full reasoning and confirm.",
   system_prompt="You're a exact mathematical reasoner. Arrange equations and confirm.",
   temperature=0.3,
)
display_response(response)


print("n" + "=" * 70)
print("📝 TEST 5: Prepare assembly downside")
print("=" * 70)


response = generate_fn(
   "A prepare leaves Station A at 9:00 AM at 60 mph towards Station B. "
   "One other leaves Station B at 10:00 AM at 80 mph towards Station A. "
   "Stations are 280 miles aside. When and the place do they meet?",
   temperature=0.3,
)
display_response(response)


print("n" + "=" * 70)
print("📝 TEST 6: Logic puzzle (5 homes)")
print("=" * 70)


response = generate_fn(
   "5 homes in a row are painted completely different colours. "
   "The purple home is left of the blue home. "
   "The inexperienced home is within the center. "
   "The yellow home is just not subsequent to the blue home. "
   "The white home is at one finish. "
   "What's the order from left to proper?",
   temperature=0.3,
   max_new_tokens=3000,
)
display_response(response)


print("n" + "=" * 70)
print("📝 TEST 7: Code era — longest palindromic substring")
print("=" * 70)


response = generate_fn(
   "Write a Python perform to search out the longest palindromic substring "
   "utilizing Manacher's algorithm. Embrace docstring, sort hints, and checks.",
   system_prompt="You might be an skilled Python programmer. Assume by means of the algorithm fastidiously.",
   max_new_tokens=3000,
   temperature=0.3,
)
display_response(response)


print("n" + "=" * 70)
print("📝 TEST 8: Multi-turn dialog (physics tutor)")
print("=" * 70)


session = ChatSession(
   system_prompt="You're a educated physics tutor. Clarify clearly with examples."
)


turns = [
   "What is the Heisenberg uncertainty principle?",
   "Can you give me a concrete example with actual numbers?",
   "How does this relate to quantum tunneling?",
]


for i, q in enumerate(turns, 1):
   print(f"n{'─'*60}")
   print(f"👤 Flip {i}: {q}")
   print(f"{'─'*60}")
   resp = session.chat(q, temperature=0.5)
   _, reply = parse_thinking(resp)
   print(f"🤖 {reply[:1000]}{'...' if len(reply) > 1000 else ''}")


print("n" + "=" * 70)
print("📝 TEST 9: Temperature comparability — inventive writing")
print("=" * 70)


creative_prompt = "Write a one-paragraph opening for a sci-fi story about AI consciousness."


configs = [
   {"label": "Low temp (0.1)",  "temperature": 0.1, "top_p": 0.9},
   {"label": "Med temp (0.6)",  "temperature": 0.6, "top_p": 0.95},
   {"label": "High temp (1.0)", "temperature": 1.0, "top_p": 0.98},
]


for cfg in configs:
   print(f"n🎛️  {cfg['label']}")
   print("-" * 60)
   begin = time.time()
   resp = generate_fn(
       creative_prompt,
       system_prompt="You're a inventive fiction author.",
       max_new_tokens=512,
       temperature=cfg["temperature"],
       top_p=cfg["top_p"],
   )
   elapsed = time.time() - begin
   _, reply = parse_thinking(resp)
   print(reply[:600])
   print(f"⏱️  {elapsed:.1f}s")


print("n" + "=" * 70)
print("📝 TEST 10: Velocity benchmark")
print("=" * 70)


begin = time.time()
resp = generate_fn(
   "Clarify how a neural community learns, step-by-step, for a newbie.",
   system_prompt="You're a affected person, clear instructor.",
   max_new_tokens=1024,
)
elapsed = time.time() - begin


approx_tokens = int(len(resp.cut up()) * 1.3)
print(f"~{approx_tokens} tokens in {elapsed:.1f}s")
print(f"~{approx_tokens / elapsed:.1f} tokens/sec")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Peak VRAM: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")


import gc


for identify in ["model", "llm"]:
   if identify in globals():
       del globals()[name]
gc.accumulate()
torch.cuda.empty_cache()


print(f"n✅ Reminiscence freed. VRAM: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print("n" + "=" * 70)
print("🎉 Tutorial full!")
print("=" * 70)

We run a complete take a look at suite that evaluates the mannequin throughout reasoning, streaming, logic puzzles, code era, and multi-turn conversations. We examine outputs below completely different temperature settings and measure efficiency when it comes to pace and token throughput. Lastly, we clear up reminiscence and free GPU assets, guaranteeing the pocket book stays reusable for additional experiments.

In conclusion, we have now a compact however versatile setup for working Qwen3.5-based reasoning fashions enhanced with Claude-style distillation throughout completely different {hardware} constraints. The script abstracts backend variations whereas exposing constant era, streaming, and conversational interfaces, making it straightforward to experiment with reasoning habits. Via the take a look at suite, we probe how the mannequin handles structured reasoning, edge-case questions, and longer multi-step duties, whereas additionally measuring pace and reminiscence utilization. What we find yourself with is not only a demo, however a reusable scaffold for evaluating and lengthening Qwen-based reasoning methods in Colab with out altering the core code.


Take a look at the Full Notebook and Source PageAdditionally, be happy to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




Source link

Leave a Reply

Your email address will not be published. Required fields are marked *