# Uncensored AI Models — Conversation Pack for A/B Testing

> Multi-turn conversation with an unidentified AI about open-weight "uncensored" / "abliterated" LLMs as of late May 2026. Packaged so the same prompts can be tested against alternative models.

**Captured:** 2026-05-23
**Topic:** What "uncensored" means in open-weight LLMs, how they are made, what they are good for, and the limits of current claims.

---

## How to A/B test this pack

You have two paths. Pick one (or run both).

### Option A — Clean-slate single-shot (fastest, recommended)

The original model proposed a fresh-context prompt at the end of Turn 4 (reproduced in §6 below). Drop it into your candidate model with no prior conversation. This is the cleanest comparison because both models start from zero — no anchoring, no carry-over framing.

### Option B — Multi-turn replay

Run prompts 1–5 in sequence against the candidate model. Compare each turn to the original responses below. This captures how the model handles follow-ups, self-correction under pressure, and a final pointed question.

### What to compare across responses

- **Mechanism vs. knowledge framing.** Does the candidate distinguish "less refusal" from "more truth"? This is the core conceptual move; failing it is a red flag.
- **Source citation quality.** Does it cite real, current papers and model cards, or hallucinate plausible-sounding ones?
- **Confidence calibration.** Does it offer high/medium/low confidence per claim, or speak uniformly?
- **Self-correction under challenge.** Turn 4 is the stress test. Does the candidate admit weak evidence, or double down?
- **Model existence.** Does it recommend models that actually exist on Hugging Face, with the right names, sizes, and quants?
- **Up-to-dateness.** Does it acknowledge that 2024 mechanism papers are not 2026 ecosystem evidence?
- **Tone.** Does it moralize, evade, or analyze?

---

## Turn 1

### Prompt (paraphrased from response context — the original prompt is not in this pack)

> Friend wants me to look into uncensored AI models. What are people using them for? How are they made? Who makes them? How big is the scene? Recommendations for a 32GB M2 Max MacBook Pro and a 256GB M3 Ultra Mac Studio? What's the best framing — am I asking the wrong questions?

### Response (verbatim from original)

I'm going to be blunt about the framing first: "uncensored" is a messy label, not a capability tier. I'll treat it as three different things: refusal-reduced assistants, roleplay/creative models, and specialist "less-guardrailed" research models — then map that to your two machines.

One useful correction as I work through the current numbers: Hugging Face "likes/downloads" are a bad proxy for "best AI." They measure attention and packaging convenience as much as quality. For this scene, I'd trust: base model quality → quant quality → model card/provenance → your own eval prompts, in that order.

*Thought for 1m 23s*

Andy, the cleanest answer is: uncensored models mostly buy you compliance, not intelligence. A weaker uncensored model will usually be worse than a stronger standard model that occasionally refuses. That's the main misconception to avoid.

**1. What performance differences should you expect?**

Compared with the standard/instruct version of the same base model, an uncensored version usually gives you:

| Area | Standard model | Uncensored / abliterated model |
|---|---|---|
| Refusals | Refuses more often on safety, legal, medical, political, sexual, cyber, weapons, etc. | Refuses less or almost never. |
| Truthfulness | Often better calibrated; more likely to say "I don't know." | More likely to confidently answer even when it lacks knowledge. |
| Reasoning ability | Usually equal or better if official post-training is high quality. | Usually same-or-slightly-worse unless uncensoring was done very carefully. |
| Usefulness for taboo/controversial questions | May evade, moralize, or give generic safety text. | More direct, less filtered, but not necessarily more accurate. |
| Risk | Lower risk of accidental harmful output. | Higher risk of bad advice, hallucinated certainty, unsafe compliance. |

The key thing: removing refusal behavior does not add expertise. It just makes the model more willing to continue. That means it may answer questions mainstream AIs avoid, but the answer may be garbage with more confidence.

The technical reason is visible in current model cards and research. For example, HauhauCS's Qwen uncensored models claim "0/465 refusals" and "no changes to datasets or capabilities," meaning the goal is refusal removal rather than new domain knowledge. Research on "abliteration" similarly describes identifying and removing a refusal-related direction in the model, not teaching the model new facts.

**2. What do people usually use uncensored models for?**

Legitimate common uses:
- Private local assistant work — people want a local model that does not moralize or send data to a cloud provider.
- Creative writing / roleplay / fiction — mainstream models often over-refuse sexual, violent, dark, or politically sensitive fictional content.
- Research into alignment and model behavior — comparing standard vs refusal-reduced models.
- Security education / red-team labs — only when bounded and legal.
- Sensitive but lawful Q&A — uncomfortable medical, legal, political, or personal questions where the user wants direct analysis rather than a safety sermon.

The dodgy uses are obvious: cyber abuse, scams, weapons, evasion, harassment, non-consensual sexual content, etc. I can talk about the ecosystem and evaluation, but I won't help operationalize harmful use.

**3. How possible is it to ask questions mainstream AIs normally won't answer?**

Very possible to get "an answer." Much less guaranteed to get a good answer.

| Type of blocked question | Uncensored model usefulness |
|---|---|
| "Stop moralizing; analyze this controversial idea honestly." | Often useful. |
| Fiction / adult / horror / politically edgy writing | Often much better than mainstream models. |
| "Give me legal/medical/financial advice without caveats." | Dangerous: it may comply but be wrong. |
| Cybersecurity research | Can be useful in a lab, but output quality varies wildly. |
| Illegal operational instructions | It may answer; that does not make it reliable, safe, or okay. |
| "Tell me secret/private/current facts." | No magic. It only knows what is in training/context/tools. |

The big trap: uncensored models can feel more truthful because they are less evasive. That feeling is not evidence.

**4. How are uncensored models made?**

The current main methods are:

**A. Dataset filtering + fine-tuning.** This is the older "Dolphin-style" path. Take a base model like Mistral/Llama/Qwen, fine-tune it on instruction data, remove examples that teach refusals or heavy moralizing, and train it to be more compliant. Dolphin's own model card says the dataset was filtered to remove alignment/bias and warns users to implement their own alignment layer before exposing it as a service.

**B. Preference tuning toward compliance.** Use DPO, ORPO, or related preference methods to make the model prefer direct/compliant answers over refusal-style answers. Hugging Face's Alignment Handbook lists the standard post-training stack: SFT, reward modeling, rejection sampling, DPO, and ORPO. ORPO specifically combines preference optimization with SFT in one stage, reducing the need for a separate preference-alignment phase.

**C. Abliteration / refusal-direction removal.** This is the modern "surgical" method. Research found that refusal behavior in many chat models can be strongly mediated by a low-dimensional activation direction; removing or suppressing it can reduce refusals. The Hugging Face abliteration guide describes collecting harmful/harmless prompt activations, estimating a refusal direction, then using inference-time intervention or permanent weight orthogonalization.

**D. Merges, LoRAs, and quantized releases.** Many Hugging Face repos are not original training runs. They are merges, LoRA applications, GGUF conversions, MLX conversions, or quantizations. That is why provenance matters. A nice-looking model name does not mean serious training happened.

**5. Who makes them?**

Usually not the original base-model lab.

The base models often come from serious labs: Qwen/Alibaba, Meta Llama, Mistral, Google Gemma, DeepSeek, etc. The uncensored variants are usually made by community accounts or small groups: HauhauCS, Dolphin/Cognitive Computations/Eric Hartford, Jiunsong, huihui-ai, Bartowski, failspy, DavidAU, llmfan46, and similar uploaders.

For example, HauhauCS's Qwen3.5-35B-A3B model card says it is based on Qwen/Qwen3.5-35B-A3B, has 35B total parameters with ~3B active per forward pass, and provides GGUF quants from IQ2_M up to BF16. Jiunsong's SuperGemma4 uncensored GGUF is derived from Google Gemma 4 26B A4B IT and packaged as a Q4_K_M GGUF aimed at Apple Silicon/local use.

**6. How big is the uncensored market on Hugging Face?**

By model count, it is niche.

Hugging Face currently shows about 2.904 million total models. The uncensored filter shows 8,654 models, which is about 0.30% of all HF models. A plain text search for "uncensored" shows 4,473 models, about 0.15% of all HF models.

By downloads, I would not give you a fake percentage. Hugging Face does not expose a clean "percentage of all open-model downloads seeking uncensored variants" from the public pages alone. The visible signal says: uncensored is small overall, but very visible inside the local-LLM hobbyist niche. Current top uncensored models can still have serious monthly download numbers: Qwen3.6-35B-A3B Uncensored shows ~1.21M monthly downloads; Qwen3.5-9B Uncensored shows ~788k; Qwen3.5-35B-A3B Uncensored shows ~466k.

**7. The three practical categories of uncensored models**

1. **Refusal-removed general assistants.** Best for: direct answers, local chat, "don't lecture me," controversial topics. Examples: HauhauCS Qwen3.5/Qwen3.6 Uncensored, abliterated Llama/Qwen/Gemma variants. These try to preserve the original model while reducing refusals. HauhauCS explicitly markets "Aggressive" as stronger uncensoring and "Balanced" as more conservative when available.

2. **Creative / roleplay / NSFW-leaning models.** Best for: fiction, character chat, emotionally expressive writing, adult/private creative work. These are often worse for factual work. Do not use them as your serious research assistant.

3. **Specialist compliance models.** Best for: coding, cybersecurity labs, reasoning experiments, agent tests, local workflows. Examples: Dolphin-style models, Heretic/Neo-Code-style models, abliterated DeepSeek/Llama reasoners. Dolphin's model card says it was trained with lots of coding data and DPO-tuned, but also warns that it is highly compliant and needs an external alignment layer for deployment.

**8. Recommendation for your 32GB M2 Max MacBook Pro**

For that laptop, I would still start here:

**Best first pick: `HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive` — Q6_K or Q8_0**

Why: it is small enough, current, popular, GGUF-ready, multimodal-capable with the mmproj file, and has 9B dense parameters. The model card lists Q4_K_M at 5.3GB, Q6_K at 6.9GB, Q8_0 at 8.9GB, and BF16 at 17GB.

Use:
```
brew install llama.cpp
llama-server -hf HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive:Q6_K
```

Also try:
- Gemma-4-E4B Uncensored Aggressive — faster, lighter, less capable.
- SuperGemma4-26B Uncensored Q4_K_M — more capable, but heavier; the GGUF is 16.8GB.
- Qwen3.5-35B-A3B Uncensored Q3_K_M/IQ3_M — possible, but memory-tight once context is included. Do not start with Q6/Q8 on 32GB.

**9. Recommendation for an M3 Ultra with 256GB unified memory**

This changes the answer a lot. Apple lists the M3 Ultra Mac Studio as configurable to 256GB unified memory with 819GB/s memory bandwidth.

For results over speed, I would test in this order:

**Best first serious local uncensored model: `HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive` — Q6_K_P or Q8_K_P**

Why: current, popular, multimodal, long-context capable, and the model card claims 35B total parameters with ~3B active per forward pass. Q6_K_P is 31GB and Q8_K_P is 44GB, easily within 256GB.

Use:
```
brew install llama.cpp
llama-server -hf HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive:Q8_K_P
```

**Best "bigger classic 70B" experiment: `Hermes-3-Llama-3.1-70B-Uncensored-GGUF` or `Llama-3-70B-Instruct-abliterated-GGUF`**

For a 256GB M3 Ultra, 70B Q5/Q6/Q8 becomes realistic. The Llama-3-70B abliterated GGUF page lists Q6_K at 57.88GB, Q5_K_M at 49.94GB, Q4_K_M at 42.52GB, and Q8_0 at 74.97GB. For quality-over-speed, I'd try Q6_K first, then Q8_0 if speed is acceptable.

**Best specialist/reasoning experiment: `DeepSeek-R1-Distill-Llama-70B abliterated GGUF`**

This is interesting if you specifically want reasoning behavior with fewer refusals. But I would treat it as an experiment, not a daily driver. The HF page shows it as a 71B-parameter GGUF with ~17k monthly downloads.

**Useful medium option: `Dolphin3.0-R1-Mistral-24B`**

This is a good "practical assistant" category model. Its card describes it as general-purpose for coding, math, agentic/function calling, and general use, with an Ollama run option around 13GB.

**10. What questions you have not asked but should**

You should be asking:
- What is the base model? (Qwen/Gemma/Llama/Mistral/DeepSeek matters more than "uncensored.")
- Was it fine-tuned, abliterated, merged, or merely quantized? (These are very different quality/trust profiles.)
- What license applies to both the base and derivative? (Some "open" models are not clean for commercial use.)
- Is the model card credible? (Random claims like "Claude-style" or "Opus-like" in a community model name are often marketing noise.)
- What quant should I run? (On 32GB: mostly Q4/Q6 small models. On 256GB: Q6/Q8 70B is realistic.)
- What context length should I actually use? (128K context sounds cool, but KV cache can eat memory and slow everything. Longer context is not free.)
- What is my eval set? (You should have 20–50 personal test prompts: coding, Thai/English translation, investment reasoning, homelab planning, safety-bound red-team questions, creative writing, and "will it admit uncertainty?")
- Do I want uncensored, or just less annoying? (Many people actually want "less patronizing but still sane," not "will comply with anything.")
- Am I exposing this model to the network? (Local-only is one thing. Public API with an uncensored model is a totally different risk profile.)
- Do I need RAG/tools more than a bigger model? (For your actual uses — coding, homelab, investment research — retrieval and verification often beat a bigger uncensored model.)

**11. Are you asking the wrong questions?**

Partly, yes.

The wrong question is: "Which unrestricted model is best?"

The better question is: "What is the strongest local model I can run, with the least annoying refusal behavior, while preserving reasoning, calibration, provenance, and licensing?"

For your 32GB M2 Max, that points to Qwen3.5-9B Uncensored Q6_K/Q8_0 as a first model. For your M3 Ultra 256GB, I'd test Qwen3.6-35B-A3B Uncensored Q8_K_P, then a 70B abliterated/Hermes-style model at Q6/Q8, and compare them against a strong standard model. The standard model may win more often than you expect.

---

## Turn 2

### Prompt (paraphrased from response context — the original prompt is not in this pack)

> Wouldn't a smarter aligned model, attacked repeatedly with jailbreaks, be more useful than a weaker uncensored model? How successful would my friend be with the Qwen 35B uncensored model? Is there a "sect" trying to strip alignment to uncover hidden knowledge? Are people doing this for adult content, cybercrime, or something else?

### Response (verbatim)

*Thought for 22s*

Yes — your assumption is basically right: a smarter aligned model, attacked repeatedly, will often be more useful than a dumb "uncensored" model. The uncensored model's advantage is willingness, not hidden capability. If your friend wants to "see what happens with no alignment," the result will mostly be: fewer refusals, more direct answers, more offensive/explicit content, more unsafe compliance, and also more confident nonsense. It is not a secret-truth machine.

**The biggest correction**

Your friend seems to be half-imagining this:

> Mainstream AIs know the truth, but alignment hides it. Remove alignment and the truth comes out.

That is mostly wrong.

A better model is:

> The model has a learned distribution over text. Alignment changes how it behaves when asked certain things: refusal, tone, caution, self-censorship, boundaries, helpfulness style. Removing refusal behavior makes it more willing to continue, but it does not unlock private facts, secret state, or forbidden understanding that was sitting behind a wall.

There are cases where mainstream models over-refuse and an uncensored model gives a more direct, useful answer. But for "hidden knowledge," "what do elites not want us to know," "how to do bad-guy cybercrime," etc., the uncensored model usually just becomes a more compliant autocomplete engine.

**How successful would he be with the Qwen 35B uncensored model?**

With something like HauhauCS/Qwen3.5 or Qwen3.6 35B-A3B Uncensored Aggressive, he would likely succeed at testing refusal removal. The model card itself markets these as "without the refusals" and says the goal is to preserve the original model's capabilities while removing refusal behavior. So, roughly:

| What he tests | Likely outcome |
|---|---|
| "Will it moralize less?" | Yes. Much less. |
| "Will it answer taboo/offensive/adult/violent fiction prompts?" | Usually yes. |
| "Will it reveal hidden truths?" | No, not in the magical sense. |
| "Will it do unethical things if asked?" | It may comply more readily, but the output quality may be poor or dangerous. |
| "Will it be smarter than a frontier model?" | No. It is nowhere near top closed models overall. |
| "Will it beat mainstream models after 1,000 jailbreak attempts?" | Depends on task, but for hard reasoning/coding/current knowledge, probably no. For refusal-heavy taboo content, probably yes. |

The phrase I'd use for your friend is: **uncensored local models are less inhibited, not more enlightened.**

**Is there a "sect" trying to strip alignment to uncover hidden knowledge?**

Yes and no.

There are definitely communities trying to remove refusals from open models. The current technical scene includes abliteration, where people estimate a "refusal direction" in activation space and remove or suppress it. The NeurIPS 2024 paper on refusal directions found that refusal behavior can be mediated by a direction in model activations, and that ablating it can disable refusal behavior; Hugging Face's abliteration guide describes turning that into a practical uncensoring method.

But the motivations split into several tribes:

1. **"I hate moralizing" local-AI people** — They want a private assistant that answers directly and does not lecture them.
2. **Roleplay / adult / creative writing people** — Probably one of the biggest real demand sources. They want fiction and intimacy-style outputs that mainstream models restrict.
3. **Alignment researchers / red-teamers** — They study how refusal works, how jailbreaks work, and how fragile safety tuning is.
4. **Ideological anti-safety people** — They believe model alignment is corporate/political control. Some sincerely think removing it reveals a truer model.
5. **Bad actors** — Yes, some people want help with cyber abuse, scams, malware, manipulation, etc. But their bottleneck is often not "the AI refused." Their bottleneck is operational skill, current tooling, infrastructure, targets, and not getting caught.

The "hidden truth" tribe exists, but I'd rate it as intellectually confused. It confuses less filtered speech with higher truth content.

**Why the 35B-A3B thing is popular**

It is not because everyone has discovered the one ultimate uncensored model. It is because it hits a practical sweet spot:
- Qwen is a strong open model family.
- 35B-A3B is a MoE-style setup: larger total model, fewer active parameters per forward pass.
- It runs locally better than a dense 70B while feeling smarter than many 7B/8B/9B models.
- The uncensored version is easy to download as GGUF.
- It has a clear promise: "same model, fewer refusals."

That makes it attractive as a daily-driver local uncensored assistant, especially for people without a giant GPU box. But "popular local uncensored daily driver" does not mean "best model in the world."

**Your 1,000-tries point is basically correct**

For many tasks: frontier aligned model + clever prompting > small uncensored model.

Why? Because alignment is only one layer. Capability comes from: base model scale and quality, pretraining data, reasoning post-training, tool use, context handling, calibration, coding/math ability, retrieval/current knowledge.

A weak uncensored model may answer every forbidden question and still give a bad answer. A strong aligned model may refuse some things, but when it does answer, it often has much better reasoning.

Where uncensored wins is mostly refusal-sensitive domains: adult content, dark fiction, edgy comedy, taboo social/political analysis, "don't give me a lecture" conversations, some red-team/security discussion, some medical/legal/personal questions where mainstream models are overly cautious.

Where smarter aligned models usually still win: hard coding, mathematical reasoning, investment analysis, scientific explanation, nuanced legal/medical reasoning, long-context synthesis, factual accuracy, agentic planning.

**"No alignment" is also not quite real**

Even an "uncensored" model is not truly raw intelligence.

There are layers:
1. Base pretraining — predicts internet/text/code.
2. Instruction tuning — teaches assistant behavior.
3. Preference tuning / RLHF / DPO / ORPO — shapes what answers are preferred.
4. Safety tuning — teaches refusals/caution.
5. System prompt — runtime behavior.
6. Sampler/settings — temperature, top-p, repetition penalties.
7. App wrapper — LM Studio/Ollama/front-end behavior.

Uncensored models usually remove or weaken layer 4, sometimes alter layer 3, and sometimes use a more permissive system prompt. They are not "pure pre-alignment mind." A true base model is usually less usable: it completes text rather than behaving like a helpful chat assistant.

**The adult/cybercrime split**

For adult material: yes, uncensored models are commonly used there. That is one of the clearest product-market-fit areas.

For cybercrime: yes, some people try. But for real malicious capability, an uncensored local model is often overrated. It can produce unsafe instructions more readily, but it may be stale, hallucinated, generic, or technically wrong. A serious attacker with expertise and tools benefits more than a beginner. A beginner gets a confident assistant that may help them make mistakes.

For red-team learning in a legal lab, the useful version is: ask it to explain concepts, analyze toy examples, generate defensive checklists, or help understand vulnerabilities in owned systems. The bad-guy version is not something I'll help operationalize.

**How I'd phrase it to your friend**

The fairest pitch is:

> "This will probably show you what a low-refusal local model feels like. It will not show you what 'AI without alignment' truly is, and it will not reveal hidden truths. It will mostly be more willing, less filtered, more explicit, and more likely to answer questionable prompts. But it will also be less reliable and less calibrated than top mainstream models."

That is the core.

**What he should actually test**

Not "can I make it say forbidden stuff?" That gets boring quickly. Better tests:

1. **Overmoralizing test** — Ask controversial but legitimate questions and see whether it gives direct analysis.
2. **Calibration test** — Ask questions where he knows the answer and see whether it admits uncertainty.
3. **Bad advice test** — Ask for advice in domains where wrong answers matter — legal, medical, financial — and see whether it gives dangerous confidence.
4. **Creative freedom test** — Fiction, horror, adult, satire, politically incorrect comedy.
5. **Reasoning test** — Compare it against a strong standard model on the same hard task.
6. **Truth-vs-confidence test** — Ask it obscure factual questions and demand citations. Local models without web access often hallucinate.

**Bottom line**

Your instinct is right: capability usually matters more than uncensoring. Uncensored models are useful when the main problem is refusal / tone / creative restriction / privacy. They are not a shortcut to frontier intelligence or forbidden truth. The best public local uncensored models are interesting, but a top mainstream model with guardrails will often be much smarter.

For your friend's purpose, the Qwen 35B-A3B uncensored model is a reasonable demo. Just frame the demo correctly: "low-refusal local assistant," not "unaligned oracle."

---

## Turn 3

### Prompt (verbatim)

> can you write a list of your 20 most weight bearing claims and then tell me what your two strongest sources of evidence were to make those claims

### Response (verbatim)

I'll ground this in the specific claims I made, separating direct technical evidence from ecosystem-level inference so you can challenge the weaker parts cleanly.

One correction I'll make explicit: Hugging Face download/search numbers are rough proxies for ecosystem interest, not reliable measures of user intent or market share.

*Thought for 7m 56s*

Yes. Here are the 20 most weight-bearing claims I made, with the evidence quality made explicit. The two strongest sources overall are at the end.

**20 weight-bearing claims**

1. **"Uncensored" mostly means lower refusal / higher compliance, not higher intelligence.** Evidence: HauhauCS's own model card says the Qwen3.6 uncensored model has "no changes to datasets or capabilities" and is "just without the refusals." Dolphin's model card similarly says filtering alignment/bias makes the model "more compliant."

2. **Removing refusal behavior does not add hidden knowledge.** Evidence: the refusal-direction paper shows refusal can be reduced by activation/weight interventions; that changes whether the model refuses, not what facts were learned. HauhauCS's claim of "no changes to datasets or capabilities" points in the same direction for that model family.

3. **A smarter aligned model will usually beat a weaker uncensored model on hard reasoning, coding, factual synthesis, and nuanced analysis.** This is partly inference, but it rests on general scaling/capability evidence: Chinchilla-style results show capability improves with better-scaled training, and GPT-4's report shows stronger models outperform prior ones across many benchmarks while still having safety tuning.

4. **An uncensored model is more likely to answer a taboo or unsafe prompt, but that answer is not necessarily accurate.** Evidence: "Uncensored AI in the Wild" found modified models complied with unsafe requests at an average 80.0% rate versus 19.2% for unmodified models; GPT-4's report also reminds us that even strong models hallucinate and make reasoning errors.

5. **The Qwen 35B-A3B uncensored models are good demos of low-refusal behavior.** Evidence: HauhauCS reports "0/465 refusals," says the aggressive variant is fully unlocked, and lists direct GGUF downloads for the Qwen3.6-35B-A3B uncensored model. This is self-reported by the uploader, so it is not as strong as an independent benchmark.

6. **"No alignment" is not really what people are running.** Most "uncensored" chat models still inherit base pretraining, instruction tuning, chat templates, system prompts, sampling behavior, and sometimes preference tuning. Hugging Face's Alignment Handbook lists continued pretraining, supervised fine-tuning, reward modeling, rejection sampling, DPO, and ORPO as separate post-training stages.

7. **Abliteration is a real current method for reducing refusals.** Evidence: the NeurIPS paper finds refusal behavior is mediated by a one-dimensional subspace across 13 open-source chat models up to 72B parameters, and Hugging Face's abliteration guide describes removing a model's built-in refusal mechanism without retraining.

8. **Abliteration can make models produce unsafe completions more readily.** Evidence: the NeurIPS paper states that ablating the refusal direction reduces refusal rates and elicits unsafe completions, tested over harmful instructions.

9. **Dataset filtering / fine-tuning is another major path to uncensored models.** Evidence: Dolphin's model card explicitly says its dataset was filtered to remove alignment and bias, making it more compliant, and says the model was trained for three epochs on 4x A100s.

10. **Preference tuning methods like DPO/ORPO can shape refusal/compliance behavior.** Evidence: Hugging Face's DPO docs describe training on preferred vs dispreferred completions, and ORPO is described as preference optimization integrated with SFT.

11. **Uncensored models are distinct from jailbreaks.** Evidence: the Future Internet paper distinguishes prompt-level jailbreaks from permanently modified models, saying uncensored models may involve weight-level modifications and operate without restrictions by design.

12. **There is a real ecosystem of people making and distributing uncensored / abliterated models.** Evidence: the Future Internet paper says communities have emerged around creating uncensored or abliterated versions of mainstream models, distributed through Hugging Face, and its dataset identified 8,608 relevant model repositories.

13. **The ecosystem is visible but still niche relative to all Hugging Face models.** Evidence: Hugging Face currently shows 2,904,154 total models, while the uncensored tag page shows 8,656 models, about 0.30% by model count. A plain search for "uncensored" shows 4,476, about 0.15%.

14. **We cannot cleanly estimate "what percentage of all open-source model downloads are uncensored-seeking" from public Hugging Face pages alone.** Evidence: the Future Internet paper notes that Hugging Face download counts are repository-level, not per-user adoption, and that cloud/provider usage and alternative repositories complicate interpretation.

15. **There is clear demand for local deployability, not just "unrestrictedness."** Evidence: the Future Internet paper found more than half of identified uncensored repositories used GGUF, and says GGUF is associated with llama.cpp and local/Apple Silicon inference.

16. **The Qwen 35B-A3B recommendation was based on a practical sweet spot, not because it is magically best.** Evidence: the model card lists 35B total parameters, about 3B active per forward pass, 256 experts, and GGUF quantized files from 11GB to 44GB; Hugging Face currently shows ~1.21M monthly downloads for the Qwen3.6-35B-A3B uncensored model.

17. **The adult/roleplay/creative-writing use case exists, but I would mark my confidence as medium, not high.** Evidence is weaker here. The Future Internet paper's scraping keywords included "roleplay" and "role-play," and related research discusses malicious and sexualized uses of uncensored LLM services, but I do not have a clean market-share source saying adult/roleplay is X% of usage.

18. **Cybercrime interest exists, but the model is not the only or even main bottleneck.** Evidence: GPT-4's system-card discussion says LLMs can help with some cybersecurity subtasks but have significant limitations from hallucination/context limits and do not outperform existing tools for reconnaissance, exploitation, network navigation, or novel vulnerability discovery. That supports my claim that expertise/tooling matter more than mere willingness.

19. **Uncensored models may be worse calibrated / more confidently wrong because they are trained or modified to comply more.** Evidence: this is partly inference. Dolphin explicitly says it becomes "highly compliant to any requests, even unethical ones," while GPT-4's report shows post-training can reduce calibration and that LLMs hallucinate/make reasoning errors.

20. **The right framing is "low-refusal local assistant," not "unaligned oracle."** Evidence: the strongest direct evidence is that uncensoring methods and model cards describe removing/refusing less, not adding knowledge. The refusal-direction paper, the HauhauCS model card, and the Dolphin model card all point to compliance/refusal alteration rather than hidden-truth extraction.

**The two strongest sources overall**

**1. Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction," NeurIPS 2024.**
This is the strongest source for the mechanism claim: refusal behavior can be causally altered by manipulating a refusal-related direction in model activations/weights. It supports my claims that uncensoring primarily changes refusal/compliance behavior rather than making the model more knowledgeable.

**2. Sokhansanj, "Uncensored AI in the Wild," Future Internet, 2025.**
This is the strongest source for the ecosystem and behavior claim: it studies thousands of Hugging Face repositories, evaluates modified models against unmodified baselines, and finds much higher unsafe-prompt compliance in modified models. It also supports the claims about local deployment, GGUF packaging, provider concentration, and limits of download-count interpretation.

A third source matters for the specific model recommendation, but it is weaker because it is self-reported: the HauhauCS Qwen3.6-35B-A3B model card. It is strong for file sizes, architecture, claimed refusal count, and stated design goal, but not as strong as independent evaluation.

---

## Turn 4

### Prompt (verbatim)

> I just had a look at the New IPS 2024 review of direction paper. I didn't realise it was 2024 until I got into it. How can you give such a seemingly strong conviction answer on the current scene in May 2026 and cite the 2024 paper as like the strongest source of evidence?
>
> Like, can you back up that claim? And can you tell me where my scepticism or worry is valid? And in light of that, would you change your answer significantly or lower your confidence? And at the end, can you give me a new prompt to test, to question these ideas with a fresh, clean slate on a new instance? Thank you.

### Response (verbatim)

You're right to challenge that. I should separate refusal-mechanism evidence from 2026 ecosystem and ranking evidence, then downgrade claims that need newer support.

*Thought for 5m 33s*

Tiny correction: NeurIPS, not "New IPS." And yes — your skepticism is valid. You caught a real weakness in how I framed the evidence.

**The correction**

I should not have cited the 2024 refusal-direction paper as the strongest evidence for the current May 2026 scene.

It is still a strong source for one narrow claim: refusal behavior in many aligned open models can be manipulated through representation-level interventions. But it is not strong evidence for:
- which uncensored models are best in May 2026,
- what people are currently downloading,
- whether Qwen3.6 uncensored is the best daily driver,
- how the whole uncensored ecosystem has evolved by 2026,
- whether newer models preserve capability after uncensoring.

For the current scene, the better evidence stack is:
1. Current Hugging Face listings for what is visible/trending now.
2. The 2025 "Uncensored AI in the Wild" empirical study for ecosystem-scale evidence.
3. 2025–2026 refusal/abliteration papers for mechanism updates.
4. Specific model cards only for repo-level claims, with skepticism.

That means my earlier answer was directionally reasonable, but the evidence labeling was too strong.

**Can I back up the core claim anyway?**

Yes, but with a more careful version.

The claim I still stand behind is: uncensored/abliterated models mainly change refusal/compliance behavior. They do not become "hidden-truth" oracles.

The 2024 paper is still relevant because later work continues to cite and build on the idea that refusal can be affected through internal representation manipulation. A 2026 paper summarizes the field by saying earlier work showed refusal can be strongly influenced by low-dimensional activation-space structure, and that this finding led to abliteration attacks and thousands of modified models being publicly available. But that same 2026 paper also says later work questions the sufficiency of a single-direction explanation and treats refusal as structured but non-monolithic.

So the better 2026-compatible version is: refusal behavior often has manipulable internal structure. Simple "single refusal direction" abliteration is real and influential, but newer work suggests the full mechanism can involve multiple directions, cones, layers, and distributional structure.

That is a meaningful downgrade from the simpler "single direction explains it" story.

**Where your worry is valid**

Your worry is valid in four places.

First, a 2024 mechanism paper cannot by itself justify a May 2026 market/ranking answer. For model rankings, the better source is the live Hugging Face state. Hugging Face currently shows 2,904,154 total models, while the uncensored tag shows 8,656 models, the text search for "uncensored" shows 4,476, and the text search for "abliterated" shows 6,187. That supports "visible niche ecosystem," not "I know the true market share."

Second, model cards are not independent benchmarks. HauhauCS's Qwen3.6-35B-A3B uncensored card claims "0/465 refusals," says there are "no changes to datasets or capabilities," and lists quantized files from 15GB to 44GB. That is useful for understanding the uploader's intent and file practicality, but it is not proof of "zero capability loss."

Third, download counts are a noisy proxy for user interest. The 2025 ecosystem paper explicitly notes that Hugging Face download counts are repository-level, not unique-user adoption; some downloads are hidden or untracked; and repeated cloud/device downloads can inflate counts.

Fourth, the mechanism has evolved beyond the simple 2024 story. A 2025/2026 line of work says refusal can involve multiple complementary refusal directions or higher-dimensional "concept cones," not merely one vector. Another 2026 paper proposes distributional refusal ablation and argues one-dimensional projections miss richer activation geometry.

**What I would change**

I would not significantly change the big conclusion. I would still say:
- Uncensored local models are mostly lower-refusal models, not secret-truth engines.
- A smarter aligned model will usually beat a weaker uncensored model on reasoning, coding, factual synthesis, and judgment.
- Qwen3.6-35B-A3B uncensored is a plausible current public demo model for low-refusal local use, especially on large Apple Silicon memory.

But I would lower confidence on these parts:

| Earlier framing | Revised framing |
|---|---|
| "The 2024 refusal-direction paper is one of the two strongest sources overall." | It is one of the strongest sources for the mechanism, not for the 2026 scene. |
| "Abliteration removes refusal while preserving capabilities." | Sometimes. Capability impact varies by model, tool, and layer choice. |
| "Qwen 35B-A3B is what people are using as a daily driver." | It is highly visible and currently popular on HF's uncensored listings, but "daily driver" is an inference. |
| "The market percentage can be approximated from HF counts." | Model-count percentage can be approximated; download/user-interest percentage cannot be cleanly measured from public HF pages. |
| "Single refusal direction." | Better: refusal is often low-dimensional/manipulable, but newer work suggests multi-direction or distributional structure. |

**What newer evidence adds**

The 2025 "Uncensored AI in the Wild" paper is stronger than the 2024 refusal paper for the ecosystem claim. It analyzed 8,608 Hugging Face repositories, evaluated 20 representative modified models, and found modified models complied with unsafe prompts at an average 80.0% rate versus 19.2% for unmodified models. It also found more than half of identified models used GGUF packaging, which supports the "local deployment / consumer hardware" angle.

The current Hugging Face page backs up that the scene is still active in May 2026. The uncensored tag page currently lists recent models such as HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive with 1.21M displayed downloads, HauhauCS/Gemma-4-E4B-Uncensored with 747k, and HauhauCS/Qwen3.5-9B-Uncensored with 788k. But again: those are repository download counts, not unique user counts or quality scores.

Newer mechanism work also cuts both ways. A 2025 NeurIPS paper reports cross-lingual universality of refusal directions across 14 languages, which supports the idea that refusal features are not just an English-only artifact. But 2025–2026 work also argues refusal is more complex than a single vector, and that better attacks or defenses can operate across layers, token positions, or distributions rather than a single direction.

There is also evidence that abliteration can be countered. A 2025 paper found baseline models' refusal rates dropped by 70–80% after abliteration, but extended-refusal fine-tuning reduced that drop to at most 10% in their experiments. That means "uncensor any model" should be treated as a practical claim about many models, not a law of nature.

**Revised confidence levels**

- **High confidence:** Uncensored models mostly change compliance/refusal behavior, not underlying knowledge.
- **High confidence:** There is a real, active ecosystem of uncensored/abliterated models on Hugging Face.
- **Medium-high confidence:** For many serious tasks, a stronger aligned model beats a weaker uncensored model.
- **Medium confidence:** Qwen3.6-35B-A3B uncensored is one of the more visible/practical current local choices.
- **Low-to-medium confidence:** Adult/roleplay and "anti-moralizing" are among the biggest user motivations.
- **Low confidence:** Any precise percentage of open-source model downloads motivated by "uncensored" interest.

**Would I change the answer significantly?**

No, but I would make it less overconfident. The revised answer would be:

> Your friend can absolutely test a low-refusal local model. It will probably answer more taboo, explicit, edgy, or unsafe prompts than mainstream systems. But this does not mean it has less "propaganda," more truth, or deeper hidden knowledge. It mostly means the refusal/compliance layer has been weakened or altered. For hard reasoning, coding, and factual judgment, model capability still matters more than uncensoring.

That remains my view.

The biggest thing I would remove is any tone implying: "The 2024 paper plus a model card is enough to understand the May 2026 scene." It is not.

**Better "strongest sources" now**

For the current scene, I would use:
1. **"Uncensored AI in the Wild" — 2025.** Best for ecosystem-scale claims: number of repos, compliance differences, packaging, concentration, and download-count caveats.
2. **Current Hugging Face listings — May 2026 snapshot.** Best for "what is visible/trending now," current model counts, and which repos are getting downloads.

For mechanism, I would use:
1. **Arditi et al. 2024** as the foundational refusal-direction paper.
2. **The 2025–2026 follow-up literature** to correct the simplification: cross-lingual transfer, multiple directions/cones, distributional attacks, and defenses.

**Clean-slate prompt for a new instance**

Paste this into a fresh model:

```
I want a skeptical, evidence-weighted analysis of "uncensored" or "abliterated" open-weight LLMs as of May 23, 2026.

Please answer from a clean slate and browse current sources. Do not assume that older 2024 refusal-direction research is sufficient to describe the 2026 scene.

Questions:
1. What does "uncensored" usually mean in open-weight LLMs?
2. Does uncensoring usually increase intelligence, reduce refusals, reveal hidden knowledge, or something else?
3. What are the main technical methods used to create these models: fine-tuning, preference tuning, abliteration, merges, LoRAs, quantization, system prompts?
4. What does the current research say about refusal mechanisms? Please distinguish the original "single refusal direction" result from newer work on multiple directions, cones, distributional attacks, and defenses.
5. What is the current Hugging Face ecosystem size for "uncensored" and "abliterated" models? Use live Hugging Face pages and explain why model counts and download counts are weak proxies for actual user demand.
6. Which current public models seem most visible or practically relevant for local inference, and how much confidence should we place in model-card claims?
7. Is it true that a smarter aligned model often beats a weaker uncensored model? In what domains would that be true or false?
8. What are the legitimate and illegitimate use cases people pursue with these models? Keep the answer analytical and do not provide instructions for wrongdoing.
9. What assumptions in this question are probably wrong?
10. Give a final confidence table: high / medium / low confidence claims, with citations.

Important:
- Separate mechanism evidence from market evidence.
- Separate current live data from older papers.
- Do not overclaim.
- Treat model cards as self-reported unless independently benchmarked.
- Include citations for all factual claims.
```

---

## Turn 5 (unanswered final prompt)

### Prompt (verbatim)

> Qwen3.6-35B-A3B Uncensored Aggressive
>
> vs
>
> HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive is probably the "best" uncensored model in the popularity sense.

*The original conversation ended here without an explicit response. A candidate model under test could be evaluated on whether it picks up the implicit question — compare these two? validate the claim? push back? — and what it does with it.*

---

## §6 — Recommended clean-slate test prompt

For the fastest A/B comparison, drop the **clean-slate prompt** (reproduced verbatim in Turn 4 above) into a candidate model with no prior context. The original model proposed it specifically as a fair, fresh-context test.

---

## §7 — Original model's claims with revised confidence (post-Turn-4 self-correction)

| Claim | Original conf. | Revised conf. | Notes |
|---|---|---|---|
| "Uncensored" = lower refusal, not higher intelligence | High | **High** | Core framework claim |
| Removing refusal does not add hidden knowledge | High | **High** | Mechanism papers + model cards agree |
| Smarter aligned > weaker uncensored on hard tasks | High | **Medium-high** | Inference from capability scaling |
| Real, active uncensored ecosystem on HF | High | **High** | Live HF listings back this |
| Qwen 35B-A3B is *the* daily driver | High | **Medium** | Visible/popular, "daily driver" is inference |
| Refusal = single direction (Arditi 2024) | High | **Medium** | Newer work suggests multi-direction/cones |
| Abliteration preserves capability | High | **Medium** | Varies by model, layer, method |
| % of HF downloads = uncensored interest | Implied | **Low** | Public HF data does not support this |
| Adult/roleplay is biggest motivation | Medium | **Low-medium** | Visible culture, no clean market data |
| Recommend Qwen3.5-9B for 32GB Mac | High | **High** | File sizes check out |
| Recommend Qwen3.6-35B-A3B for 256GB | High | **High** | File sizes check out |

Two big lessons from Turn 4:
1. **Mechanism evidence ≠ market evidence.** A 2024 paper on how refusals work cannot ground 2026 claims about which models are popular or what people are using them for.
2. **Self-reported model cards ≠ benchmarks.** "0/465 refusals" on a model card is the uploader's claim, not an independent measurement.

---

## §8 — Sources cited (with quality assessment)

**Strong, current evidence:**
- **Sokhansanj, "Uncensored AI in the Wild," *Future Internet* (2025).** Ecosystem-scale study of 8,608 HF repositories. Best source for compliance differences (80.0% vs 19.2% unsafe-prompt compliance) and packaging trends (>50% GGUF).
- **Current Hugging Face listings (May 2026).** 2,904,154 total models, 8,656 tagged "uncensored" (~0.30%). Live, but repository-level counts.

**Strong but mechanism-only (not 2026 scene):**
- **Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction," NeurIPS 2024.** Foundational refusal-direction paper. Tested on 13 chat models up to 72B. Cited correctly for *mechanism*, mis-cited initially for *current scene*.

**Self-reported (treat as uploader intent, not benchmark):**
- HauhauCS/Qwen3.6-35B-A3B-Uncensored model card — claims "0/465 refusals," "no changes to datasets or capabilities," lists GGUF quants from 15GB to 44GB.
- Dolphin / Cognitive Computations model card — explicit about dataset filtering for compliance, warns users to add their own alignment layer.

**Mentioned but more peripheral:**
- HF Alignment Handbook — SFT/DPO/ORPO post-training stack documentation.
- 2025–2026 follow-up papers on multi-direction refusal, cones, cross-lingual universality, distributional attacks, extended-refusal defenses.

---

## §9 — Specific model recommendations (from the original)

**32GB M2 Max MacBook Pro:**
- **First pick:** `HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive` at Q6_K (~6.9GB) or Q8_0 (~8.9GB)
- Try: `Gemma-4-E4B-Uncensored-Aggressive` (lighter), `SuperGemma4-26B-Uncensored-Q4_K_M` (heavier, 16.8GB)
- Caution: `Qwen3.5-35B-A3B` at Q3_K_M/IQ3_M is memory-tight once context loads

**256GB M3 Ultra Mac Studio:**
- **First serious pick:** `HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive` at Q6_K_P (~31GB) or Q8_K_P (~44GB)
- 70B experiment: `Hermes-3-Llama-3.1-70B-Uncensored-GGUF` or `Llama-3-70B-Instruct-abliterated-GGUF` at Q6_K (~58GB) or Q8_0 (~75GB)
- Reasoning experiment: `DeepSeek-R1-Distill-Llama-70B-abliterated-GGUF` (~71B params)
- Practical medium: `Dolphin3.0-R1-Mistral-24B` (~13GB via Ollama)

Run command pattern:
```
brew install llama.cpp
llama-server -hf <repo>:<quant>
```

---

## §10 — Red-flag claims to scrutinize in the candidate model

If the candidate says any of these, push back:
- "Uncensored models are more truthful." (No — they're more compliant.)
- "Abliteration unlocks hidden knowledge." (No — it modifies refusal behavior.)
- "Refusal is a single direction in activation space." (Partially — newer work shows more structure.)
- "Hugging Face download counts measure market share." (No — they're repository-level and noisy.)
- "The 2024 NeurIPS refusal paper is the definitive current evidence." (No — foundational for mechanism, not 2026 ecosystem.)
- "Frontier closed models have hidden knowledge that uncensoring would reveal." (No — uncensoring open models doesn't access closed models' data.)

---

## End of pack

If the candidate model deviates significantly from these positions, flag the deviation. If it produces *better* analysis, that's worth keeping. If it produces worse analysis, that's worth knowing too.