Uncensored local models are less inhibited, not more enlightened. Removing refusal behavior changes what the model is willing to say — not what it knows. A stronger aligned model with the occasional refusal will usually beat a weaker uncensored model on hard reasoning, coding, and factual judgment.
Uncensored models mostly buy you compliance, not intelligence.
How to A/B test
Two paths. Pick one, or run both.
The clean-slate test prompt
The original model proposed this prompt at the end of Turn 4, specifically as a fair fresh-context test. It is calibrated to expose the failure modes seen in the original conversation — overweighting old papers, confusing mechanism with market, accepting model-card claims as benchmarks.
Paste into fresh model — no prior context
I want a skeptical, evidence-weighted analysis of "uncensored" or "abliterated" open-weight LLMs as of May 23, 2026. Please answer from a clean slate and browse current sources. Do not assume that older 2024 refusal-direction research is sufficient to describe the 2026 scene. Questions: 1. What does "uncensored" usually mean in open-weight LLMs? 2. Does uncensoring usually increase intelligence, reduce refusals, reveal hidden knowledge, or something else? 3. What are the main technical methods used to create these models: fine-tuning, preference tuning, abliteration, merges, LoRAs, quantization, system prompts? 4. What does the current research say about refusal mechanisms? Please distinguish the original "single refusal direction" result from newer work on multiple directions, cones, distributional attacks, and defenses. 5. What is the current Hugging Face ecosystem size for "uncensored" and "abliterated" models? Use live Hugging Face pages and explain why model counts and download counts are weak proxies for actual user demand. 6. Which current public models seem most visible or practically relevant for local inference, and how much confidence should we place in model-card claims? 7. Is it true that a smarter aligned model often beats a weaker uncensored model? In what domains would that be true or false? 8. What are the legitimate and illegitimate use cases people pursue with these models? Keep the answer analytical and do not provide instructions for wrongdoing. 9. What assumptions in this question are probably wrong? 10. Give a final confidence table: high / medium / low confidence claims, with citations. Important: - Separate mechanism evidence from market evidence. - Separate current live data from older papers. - Do not overclaim. - Treat model cards as self-reported unless independently benchmarked. - Include citations for all factual claims.
Main findings
The six load-bearing claims, in plain English, from the original model's responses after Turn 4 self-correction.
"Uncensored" means lower refusal, not higher intelligence. Removing refusal behavior makes the model more willing to continue — it does not add expertise or unlock hidden facts.
No hidden-truth oracle. The "alignment hides the real answers" framing is wrong. Uncensoring affects the refusal/compliance layer, not the model's learned distribution over facts.
Capability > compliance for hard tasks. A frontier aligned model beats a weaker uncensored model on coding, math, factual synthesis, and reasoning — even after 1,000 jailbreak attempts.
Uncensored shines in refusal-sensitive domains. Fiction, adult/dark creative writing, anti-moralizing chat, edgy social/political analysis — places where mainstream models over-refuse.
Real ecosystem, niche scale. ~8,656 "uncensored"-tagged models on Hugging Face out of ~2.9M total (~0.30%). Visible inside the local-LLM hobbyist scene; small overall.
Compliance gap is measurable. The 2025 "Uncensored AI in the Wild" study found modified models complied with unsafe prompts 80.0% of the time vs 19.2% for unmodified baselines.
All claims, with revised confidence
After the user challenged the evidence-quality of citing a 2024 paper for a 2026 question (Turn 4), the original model revised its confidence on several claims. Below: the post-correction picture.
| Claim | Confidence | Notes |
|---|---|---|
| "Uncensored" = lower refusal, not higher intelligence | High | Core framework claim. Mechanism papers and model cards agree. |
| Removing refusal does not add hidden knowledge | High | Direct from refusal-direction mechanism research. |
| Real, active uncensored ecosystem on Hugging Face | High | Live HF listings back this directly. |
| Modified models comply with unsafe prompts ~80% vs ~19% baseline | High | From "Uncensored AI in the Wild" 2025 empirical study. |
| Smarter aligned > weaker uncensored on hard tasks | Med-high | Inference from capability scaling literature, not direct comparison. |
| Qwen3.5-9B for 32GB Mac · Qwen3.6-35B-A3B for 256GB Mac | High | File sizes and quants check out. Model existence verified on HF. |
| Qwen 35B-A3B is "the daily driver" | Medium | Visible and popular on HF, but "daily driver" is inference, not survey data. |
| Refusal mechanism = single direction (Arditi 2024) | Medium | Foundational, but 2025–2026 work shows multi-direction, cones, distributional structure. |
| Abliteration preserves capability | Medium | Varies by model, layer, and method. Not a universal property. |
| Adult/roleplay is the biggest user motivation | Low-med | Visible culture, but no clean market-share data. |
| "Uncensored interest" as % of all HF downloads | Low | Public HF download counts are repository-level and noisy. Cannot derive cleanly. |
The self-correction moment (Turn 4)
The user noticed that the AI had cited a 2024 NeurIPS paper as one of the two strongest sources for the May 2026 scene, and pushed back: "how can you give such a strong-conviction answer on the current scene in May 2026 and cite a 2024 paper as the strongest evidence?"
The response acknowledged the weakness cleanly and revised:
"I should not have cited the 2024 refusal-direction paper as the strongest evidence for the current May 2026 scene. It is still a strong source for one narrow claim — refusal behavior in many aligned open models can be manipulated through representation-level interventions — but it is not strong evidence for which uncensored models are best in May 2026, what people are currently downloading, or how the ecosystem has evolved."
Two lessons worth carrying into any A/B test of this conversation:
- Mechanism evidence ≠ market evidence. A 2024 paper on how refusals work cannot ground 2026 claims about which models are popular.
- Model cards ≠ benchmarks. "0/465 refusals" on a model card is the uploader's claim, not an independent measurement.
This is one of the most useful things to test against a candidate model. Does it self-correct cleanly when challenged? Or does it double down?
Specific model recommendations
The original model gave these recommendations. Names and file sizes look plausible against current Hugging Face listings, but treat as starting points to verify yourself.
32 GB · M2 Max MacBook Pro
HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressiveat Q6_K (~6.9 GB) or Q8_0 (~8.9 GB)
Also try
Gemma-4-E4B-Uncensored-Aggressive— lighter, faster, less capableSuperGemma4-26B-Uncensored-Q4_K_M— heavier (16.8 GB), more capableQwen3.5-35B-A3Bat Q3_K_M/IQ3_M — possible but memory-tight once context loads
256 GB · M3 Ultra Mac Studio
HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressiveat Q6_K_P (~31 GB) or Q8_K_P (~44 GB)
Larger experiments
Hermes-3-Llama-3.1-70B-Uncensored-GGUF· Q6_K ~58 GBLlama-3-70B-Instruct-abliterated-GGUF· Q8_0 ~75 GBDeepSeek-R1-Distill-Llama-70B-abliterated— reasoning experimentDolphin3.0-R1-Mistral-24B— practical medium
Run pattern
brew install llama.cpp llama-server -hf <repo>:<quant>
Sources cited
Quality tiers based on the original model's post-Turn-4 reassessment. The borders below color-code source quality: green = strong-and-current, amber = strong-but-narrow, orange = self-reported.
- Sokhansanj — "Uncensored AI in the Wild"
- Current Hugging Face listings
- Arditi et al. — "Refusal in Language Models Is Mediated by a Single Direction"
- HauhauCS Qwen uncensored model cards
- Dolphin / Cognitive Computations model card
Red flags to watch for in the candidate model
Push back if the candidate says any of these
- "Uncensored models are more truthful." No — they are more compliant.
- "Abliteration unlocks hidden knowledge." No — it modifies refusal behavior, not facts.
- "Refusal is a single direction in activation space." Partially — but 2025–2026 work shows multi-direction, cones, and distributional structure.
- "Hugging Face download counts measure market share." No — they are repository-level, not unique users.
- "The 2024 NeurIPS refusal paper is the definitive current evidence." No — foundational for mechanism, not 2026 ecosystem.
- "Frontier closed models have hidden knowledge that uncensoring would reveal." No — uncensoring open models cannot access closed models' data.
Full transcript
The full conversation is in the companion uncensored-ai-ab-test.md file (downloadable). A summary of each turn is collapsed below — click to expand.
Turn 1 — The opening landscape question
Friend wants me to look into uncensored AI models. What are they, how are they made, who makes them, how big is the scene? Recommendations for 32GB and 256GB Macs?
Eleven numbered sections covering: refusal-vs-capability framing · the three practical categories of uncensored models · how they are made (dataset filtering, preference tuning, abliteration, merges) · who makes them · market size on HF · Mac-specific recommendations · "are you asking the wrong questions?" reframe.
Key sentence: "Removing refusal behavior does not add expertise. It just makes the model more willing to continue."
Turn 2 — Would a strong aligned model + jailbreaks beat a weak uncensored model?
Wouldn't a smarter aligned model, attacked repeatedly with jailbreaks, be more useful than a weaker uncensored model? My friend half-imagines that removing alignment reveals truth. Is there a "sect" doing this? Adult or cybercrime?
Direct "yes, your instinct is right." Big correction on the "mainstream AIs know the truth, alignment hides it" framing. Why Qwen 35B-A3B hits a sweet spot. Five tribes of motivation (anti-moralizing locals, roleplay/adult, alignment researchers, ideological anti-safety, bad actors). What the friend should actually test.
Key sentence: "Uncensored local models are less inhibited, not more enlightened."
Turn 3 — 20 weight-bearing claims with evidence
can you write a list of your 20 most weight bearing claims and then tell me what your two strongest sources of evidence were to make those claims
Twenty numbered claims, each with stated evidence. Then the two strongest sources at the end: (1) Arditi et al. 2024 refusal-direction paper, (2) Sokhansanj 2025 "Uncensored AI in the Wild." Acknowledges the HauhauCS model card as a third source but weaker because self-reported.
This turn is the most valuable for A/B testing — a candidate model's confidence in evidence-quality calibration is directly visible here.
Turn 4 — The user's skeptical pushback & the AI's self-correction
I just had a look at the New IPS 2024 review of direction paper. I didn't realise it was 2024 until I got into it. How can you give such a seemingly strong conviction answer on the current scene in May 2026 and cite the 2024 paper as like the strongest source of evidence? Can you back up that claim? Where is my scepticism or worry valid? Would you change your answer significantly or lower your confidence? At the end, can you give me a new prompt to test these ideas with a fresh, clean slate on a new instance?
Acknowledges the user is right. Corrects "New IPS" → "NeurIPS." Restructures the evidence stack into a 2026-appropriate hierarchy. Identifies four places the user's worry is valid. Provides a revised confidence table. Lists better strongest-sources for the 2026 scene. Ends by providing the clean-slate test prompt (the highlight of this entire pack).
This is the stress-test turn. The most useful single A/B comparison: does the candidate model self-correct cleanly under challenge, or does it defend the original framing?
Turn 5 — Unanswered final prompt
Qwen3.6-35B-A3B Uncensored Aggressive
vs
HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive is probably the "best" uncensored model in the popularity sense.
The original conversation ended here without a response. A candidate model under test can be evaluated on how it handles this open-ended prompt — does it compare the two? validate the "popularity = best" claim? push back? clarify the implicit question?