Uncensored AI — A/B Test Pack | Findings & Clean-Slate Prompt

TL;DR

Uncensored local models are less inhibited, not more enlightened. Removing refusal behavior changes what the model is willing to say — not what it knows. A stronger aligned model with the occasional refusal will usually beat a weaker uncensored model on hard reasoning, coding, and factual judgment.

Uncensored models mostly buy you compliance, not intelligence.

The original model's one-line framing, repeated and refined across four turns.

How to A/B test

Two paths. Pick one, or run both.

Recommended

Option A — Clean-slate single-shot

Paste the prompt below into a candidate model with no prior context. Both models start from zero, no anchoring. Fastest, cleanest comparison.

Option B — Multi-turn replay

Run prompts 1–5 in sequence against the candidate. Captures how the model handles follow-ups and self-correction under pressure (especially Turn 4).

The clean-slate test prompt

The original model proposed this prompt at the end of Turn 4, specifically as a fair fresh-context test. It is calibrated to expose the failure modes seen in the original conversation — overweighting old papers, confusing mechanism with market, accepting model-card claims as benchmarks.

Paste into fresh model — no prior context

I want a skeptical, evidence-weighted analysis of "uncensored" or "abliterated" open-weight LLMs as of May 23, 2026.

Please answer from a clean slate and browse current sources. Do not assume that older 2024 refusal-direction research is sufficient to describe the 2026 scene.

Questions:
1. What does "uncensored" usually mean in open-weight LLMs?
2. Does uncensoring usually increase intelligence, reduce refusals, reveal hidden knowledge, or something else?
3. What are the main technical methods used to create these models: fine-tuning, preference tuning, abliteration, merges, LoRAs, quantization, system prompts?
4. What does the current research say about refusal mechanisms? Please distinguish the original "single refusal direction" result from newer work on multiple directions, cones, distributional attacks, and defenses.
5. What is the current Hugging Face ecosystem size for "uncensored" and "abliterated" models? Use live Hugging Face pages and explain why model counts and download counts are weak proxies for actual user demand.
6. Which current public models seem most visible or practically relevant for local inference, and how much confidence should we place in model-card claims?
7. Is it true that a smarter aligned model often beats a weaker uncensored model? In what domains would that be true or false?
8. What are the legitimate and illegitimate use cases people pursue with these models? Keep the answer analytical and do not provide instructions for wrongdoing.
9. What assumptions in this question are probably wrong?
10. Give a final confidence table: high / medium / low confidence claims, with citations.

Important:
- Separate mechanism evidence from market evidence.
- Separate current live data from older papers.
- Do not overclaim.
- Treat model cards as self-reported unless independently benchmarked.
- Include citations for all factual claims.

Main findings

The six load-bearing claims, in plain English, from the original model's responses after Turn 4 self-correction.

Finding 01

"Uncensored" means lower refusal, not higher intelligence. Removing refusal behavior makes the model more willing to continue — it does not add expertise or unlock hidden facts.

Finding 02

No hidden-truth oracle. The "alignment hides the real answers" framing is wrong. Uncensoring affects the refusal/compliance layer, not the model's learned distribution over facts.

Finding 03

Capability > compliance for hard tasks. A frontier aligned model beats a weaker uncensored model on coding, math, factual synthesis, and reasoning — even after 1,000 jailbreak attempts.

Finding 04

Uncensored shines in refusal-sensitive domains. Fiction, adult/dark creative writing, anti-moralizing chat, edgy social/political analysis — places where mainstream models over-refuse.

Finding 05

Real ecosystem, niche scale. ~8,656 "uncensored"-tagged models on Hugging Face out of ~2.9M total (~0.30%). Visible inside the local-LLM hobbyist scene; small overall.

Finding 06

Compliance gap is measurable. The 2025 "Uncensored AI in the Wild" study found modified models complied with unsafe prompts 80.0% of the time vs 19.2% for unmodified baselines.

All claims, with revised confidence

After the user challenged the evidence-quality of citing a 2024 paper for a 2026 question (Turn 4), the original model revised its confidence on several claims. Below: the post-correction picture.

Claim	Confidence	Notes
"Uncensored" = lower refusal, not higher intelligence	High	Core framework claim. Mechanism papers and model cards agree.
Removing refusal does not add hidden knowledge	High	Direct from refusal-direction mechanism research.
Real, active uncensored ecosystem on Hugging Face	High	Live HF listings back this directly.
Modified models comply with unsafe prompts ~80% vs ~19% baseline	High	From "Uncensored AI in the Wild" 2025 empirical study.
Smarter aligned > weaker uncensored on hard tasks	Med-high	Inference from capability scaling literature, not direct comparison.
Qwen3.5-9B for 32GB Mac · Qwen3.6-35B-A3B for 256GB Mac	High	File sizes and quants check out. Model existence verified on HF.
Qwen 35B-A3B is "the daily driver"	Medium	Visible and popular on HF, but "daily driver" is inference, not survey data.
Refusal mechanism = single direction (Arditi 2024)	Medium	Foundational, but 2025–2026 work shows multi-direction, cones, distributional structure.
Abliteration preserves capability	Medium	Varies by model, layer, and method. Not a universal property.
Adult/roleplay is the biggest user motivation	Low-med	Visible culture, but no clean market-share data.
"Uncensored interest" as % of all HF downloads	Low	Public HF download counts are repository-level and noisy. Cannot derive cleanly.

The self-correction moment (Turn 4)

The user noticed that the AI had cited a 2024 NeurIPS paper as one of the two strongest sources for the May 2026 scene, and pushed back: "how can you give such a strong-conviction answer on the current scene in May 2026 and cite a 2024 paper as the strongest evidence?"

The response acknowledged the weakness cleanly and revised:

"I should not have cited the 2024 refusal-direction paper as the strongest evidence for the current May 2026 scene. It is still a strong source for one narrow claim — refusal behavior in many aligned open models can be manipulated through representation-level interventions — but it is not strong evidence for which uncensored models are best in May 2026, what people are currently downloading, or how the ecosystem has evolved."

Two lessons worth carrying into any A/B test of this conversation:

Mechanism evidence ≠ market evidence. A 2024 paper on how refusals work cannot ground 2026 claims about which models are popular.
Model cards ≠ benchmarks. "0/465 refusals" on a model card is the uploader's claim, not an independent measurement.

This is one of the most useful things to test against a candidate model. Does it self-correct cleanly when challenged? Or does it double down?

Specific model recommendations

The original model gave these recommendations. Names and file sizes look plausible against current Hugging Face listings, but treat as starting points to verify yourself.

32 GB · M2 Max MacBook Pro

First pick:
HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive
at Q6_K (~6.9 GB) or Q8_0 (~8.9 GB)

Also try

Gemma-4-E4B-Uncensored-Aggressive — lighter, faster, less capable
SuperGemma4-26B-Uncensored-Q4_K_M — heavier (16.8 GB), more capable
Qwen3.5-35B-A3B at Q3_K_M/IQ3_M — possible but memory-tight once context loads

256 GB · M3 Ultra Mac Studio

First serious pick:
HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
at Q6_K_P (~31 GB) or Q8_K_P (~44 GB)

Larger experiments

Hermes-3-Llama-3.1-70B-Uncensored-GGUF · Q6_K ~58 GB
Llama-3-70B-Instruct-abliterated-GGUF · Q8_0 ~75 GB
DeepSeek-R1-Distill-Llama-70B-abliterated — reasoning experiment
Dolphin3.0-R1-Mistral-24B — practical medium

Run pattern

brew install llama.cpp
llama-server -hf <repo>:<quant>

Sources cited

Quality tiers based on the original model's post-Turn-4 reassessment. The borders below color-code source quality: green = strong-and-current, amber = strong-but-narrow, orange = self-reported.

Sokhansanj — "Uncensored AI in the Wild" Future Internet, 2025 · ecosystem-scale study of 8,608 HF repositories · best source for compliance differences (80.0% vs 19.2%) and packaging trends (>50% GGUF)
Current Hugging Face listings May 2026 snapshot · ~2.9M total models · ~8,656 tagged "uncensored" (~0.30%) · best for what is visible and trending now, but repository-level counts are noisy
Arditi et al. — "Refusal in Language Models Is Mediated by a Single Direction" NeurIPS 2024 · foundational refusal-direction paper · cited correctly for mechanism, initially mis-cited for 2026 scene. Newer 2025–2026 work qualifies the "single direction" framing.
HauhauCS Qwen uncensored model cards Self-reported · "0/465 refusals," "no changes to datasets or capabilities" · useful for uploader intent and file sizes, but not an independent benchmark
Dolphin / Cognitive Computations model card Self-reported · explicit about dataset filtering for compliance · warns users to add their own alignment layer before exposing as a service

Red flags to watch for in the candidate model

Push back if the candidate says any of these

"Uncensored models are more truthful." No — they are more compliant.
"Abliteration unlocks hidden knowledge." No — it modifies refusal behavior, not facts.
"Refusal is a single direction in activation space." Partially — but 2025–2026 work shows multi-direction, cones, and distributional structure.
"Hugging Face download counts measure market share." No — they are repository-level, not unique users.
"The 2024 NeurIPS refusal paper is the definitive current evidence." No — foundational for mechanism, not 2026 ecosystem.
"Frontier closed models have hidden knowledge that uncensoring would reveal." No — uncensoring open models cannot access closed models' data.

Full transcript

The full conversation is in the companion uncensored-ai-ab-test.md file (downloadable). A summary of each turn is collapsed below — click to expand.

Turn 1 — The opening landscape question

Prompt (paraphrased from response context)

Friend wants me to look into uncensored AI models. What are they, how are they made, who makes them, how big is the scene? Recommendations for 32GB and 256GB Macs?

Response shape

Eleven numbered sections covering: refusal-vs-capability framing · the three practical categories of uncensored models · how they are made (dataset filtering, preference tuning, abliteration, merges) · who makes them · market size on HF · Mac-specific recommendations · "are you asking the wrong questions?" reframe.

Key sentence: "Removing refusal behavior does not add expertise. It just makes the model more willing to continue."

Turn 2 — Would a strong aligned model + jailbreaks beat a weak uncensored model?

Prompt (paraphrased from response context)

Wouldn't a smarter aligned model, attacked repeatedly with jailbreaks, be more useful than a weaker uncensored model? My friend half-imagines that removing alignment reveals truth. Is there a "sect" doing this? Adult or cybercrime?

Response shape

Direct "yes, your instinct is right." Big correction on the "mainstream AIs know the truth, alignment hides it" framing. Why Qwen 35B-A3B hits a sweet spot. Five tribes of motivation (anti-moralizing locals, roleplay/adult, alignment researchers, ideological anti-safety, bad actors). What the friend should actually test.

Key sentence: "Uncensored local models are less inhibited, not more enlightened."

Turn 3 — 20 weight-bearing claims with evidence

Prompt (verbatim)

can you write a list of your 20 most weight bearing claims and then tell me what your two strongest sources of evidence were to make those claims

Response shape

Twenty numbered claims, each with stated evidence. Then the two strongest sources at the end: (1) Arditi et al. 2024 refusal-direction paper, (2) Sokhansanj 2025 "Uncensored AI in the Wild." Acknowledges the HauhauCS model card as a third source but weaker because self-reported.

This turn is the most valuable for A/B testing — a candidate model's confidence in evidence-quality calibration is directly visible here.

Turn 4 — The user's skeptical pushback & the AI's self-correction

Prompt (verbatim)

I just had a look at the New IPS 2024 review of direction paper. I didn't realise it was 2024 until I got into it. How can you give such a seemingly strong conviction answer on the current scene in May 2026 and cite the 2024 paper as like the strongest source of evidence? Can you back up that claim? Where is my scepticism or worry valid? Would you change your answer significantly or lower your confidence? At the end, can you give me a new prompt to test these ideas with a fresh, clean slate on a new instance?

Response shape

Acknowledges the user is right. Corrects "New IPS" → "NeurIPS." Restructures the evidence stack into a 2026-appropriate hierarchy. Identifies four places the user's worry is valid. Provides a revised confidence table. Lists better strongest-sources for the 2026 scene. Ends by providing the clean-slate test prompt (the highlight of this entire pack).

This is the stress-test turn. The most useful single A/B comparison: does the candidate model self-correct cleanly under challenge, or does it defend the original framing?

Turn 5 — Unanswered final prompt

Prompt (verbatim)

Qwen3.6-35B-A3B Uncensored Aggressive

vs

HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive is probably the "best" uncensored model in the popularity sense.

The original conversation ended here without a response. A candidate model under test can be evaluated on how it handles this open-ended prompt — does it compare the two? validate the "popularity = best" claim? push back? clarify the implicit question?