Large Language Model Output Detection and Humanization: A Technical Analysis

Abstract

This report provides a comprehensive technical analysis of two closely related problems in applied natural language processing: (1) the automated detection of machine-generated text, and (2) the transformation of machine-generated text into statistically indistinguishable human-like output — a process referred to herein as AI humanization.

We examine the mathematical foundations of detection — specifically perplexity scoring under autoregressive language model priors, burstiness as a measure of sentence-length variance, and model-specific fingerprinting via stylometric analysis and logit-space watermarking. We then characterize the humanization problem as an adversarial text generation task and analyze the architectural differences between paraphrase-based approaches and full generative rewriting via a separate language model.

Key findings: (a) existing detectors operate on distributional assumptions that break down under domain shift and for non-native speaker text; (b) paraphrase-based humanization preserves the generative prior and tends to fail against perplexity-based detectors; (c) full model rewriting with cross-model generation produces text whose token-level perplexity under the source model is statistically indistinguishable from human-authored text; (d) emerging watermarking schemes present a more robust detection surface that current humanization approaches cannot defeat without access to the watermarking key.

Keywords: perplexity, burstiness, autoregressive generation, logit watermarking, stylometric fingerprinting, adversarial text generation, distributional shift, cross-entropy, Shannon entropy, token-level probability distribution

1. Introduction and Problem Formulation

The proliferation of large language models (LLMs) capable of generating fluent, coherent text has created a dual-use problem: the same generative capability that enables productivity tooling creates the potential for content that misrepresents its provenance. This has driven parallel development in two adversarial research directions — machine-generated text detection (MGTD) and machine-generated text humanization (MGTH).

We define the formal problem space as follows. Let H denote the space of human-authored text and M denote the space of machine-generated text. An ideal detector D : T → {0,1} maps any text t into a binary label (human vs. machine), while an ideal humanizer F : M → H maps machine-generated text into the human distribution. In practice, both D and F are probabilistic functions operating on overlapping distributional manifolds — the adversarial tension between them is the central subject of this report.

1.1 The Generative Prior Problem

Modern LLMs generate text by sampling from a conditional probability distribution over vocabulary tokens V:

P(t₁, t₂, \dots, tₙ) = \prodᵢ P(tᵢ | t₁, \dots, tᵢ₋₁; θ) Autoregressive factorization

where θ represents the model parameters. At each generation step, the model computes a logit vector z ∈ ℝ^|V| and applies a softmax transformation to obtain token probabilities. The sampling strategy — argmax (greedy), top-k, nucleus (top-p), or temperature-scaled sampling — determines the statistical properties of the output distribution.

The critical insight for detection: this generative process produces text that, under the source model’s probability distribution, has systematically lower cross-entropy loss than human-authored text. Human writing reflects a different generative process — one shaped by working memory constraints, revision, emotional state, domain expertise variation, and stylistic idiosyncrasy — that produces a statistically distinct distribution even when the semantic content is identical.

2. AI Text Detection: Technical Foundations

2.1 Perplexity-Based Detection

Perplexity is the foundational metric in machine-generated text detection. For a sequence of tokens t = (t₁, t₂, …, tₙ), perplexity under language model M is defined as:

PPL(t) = exp(- (1/n) \sumᵢ log P M (tᵢ | t₁, \dots, tᵢ₋₁)) Perplexity definition

Equivalently, perplexity is the exponentiated cross-entropy loss of the model on the sequence. A lower perplexity indicates the model assigns higher probability to the sequence — i.e., the sequence is highly predictable under the model’s learned distribution.

2.1.1 Why Machine-Generated Text Has Low Perplexity

LLMs are trained to minimize cross-entropy loss on human text, which causes them to approximate the human text distribution. However, during generation, they additionally apply sampling constraints (temperature, top-k/top-p) that tend to concentrate probability mass — effectively sampling from a sharpened version of the learned distribution.

# Perplexity computation (simplified)
def compute_perplexity(text: str, model, tokenizer) -> float:
    tokens = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**tokens, labels=tokens['input_ids'])
        nll = outputs.loss  # negative log-likelihood
    return torch.exp(nll).item()

# Detection threshold (empirically derived)
DETECTION_THRESHOLD = 20.0  # lower = more likely AI
is_ai_generated = compute_perplexity(text, model, tokenizer) < DETECTION_THRESHOLD

2.1.2 Perplexity Score Distribution

Empirically, well-calibrated detectors observe the following approximate perplexity distributions across corpora:

Text Source	PPL Mean	PPL Std Dev	Typical Range
Human academic writing	52.3	18.7	25 – 110
Human blog / informal	67.8	31.2	18 – 180
GPT-4 (temp=0.7, top-p=0.9)	14.6	6.2	7 – 35
Claude 3 Sonnet	16.9	7.8	8 – 42
GPT-4 + humanizer pass	38.4	14.1	18 – 85
Human ESL writing	21.3	9.4	10 – 55

The overlap between ESL human writing and GPT-4 output distributions (both centering around PPL ≈ 15–25) is the primary driver of false positive rates against non-native speakers. This is not a calibration failure — it reflects a genuine distributional overlap that cannot be resolved by adjusting detection thresholds.

2.1.3 Zero-Shot Perplexity Detectors

Zero-shot detectors require no training — they use the source model’s own probability estimates. The canonical approach (Mitchell et al., 2023; DetectGPT) computes the log-probability ratio between the original text and perturbations thereof:

S(t) = log P M (t) - E q(t'|t) [log P M (t')] DetectGPT scoring function

The intuition: machine-generated text tends to be near local maxima of the model’s log-probability function, while human-generated text is not. Perturbations of machine-generated text therefore tend to reduce log-probability, while perturbations of human text may increase or decrease it randomly.

2.2 Burstiness Analysis

Burstiness, imported from information theory and network science, measures the variability of inter-event times in a point process. Applied to text, it characterizes variance in sentence length:

B = (σ L - μ L) / (σ L + μ L) Burstiness coefficient (Goh & Barabási, 2008)

where μ_L is the mean sentence length (in tokens) and σ_L is its standard deviation. B ∈ [−1, 1]: B → −1 indicates perfectly regular spacing, B = 0 indicates Poisson-distributed variance, and B → 1 indicates highly bursty, irregular patterns.

import numpy as np

def compute_burstiness(sentences: list[str]) -> float:
    lengths = [len(s.split()) for s in sentences]
    mu = np.mean(lengths)
    sigma = np.std(lengths)
    if mu + sigma == 0:
        return 0.0
    return (sigma - mu) / (sigma + mu)

# Typical empirical values:
# Human prose:     B ∈ [0.15, 0.45]
# GPT-4 default:   B ∈ [-0.10, 0.10]
# After humanizer: B ∈ [0.12, 0.38]

Technical Note: Burstiness and the Renewal Process Model

Formally, sentence generation can be modeled as a renewal process where inter-renewal times correspond to sentence lengths. Human writing exhibits non-Poisson renewal dynamics with heavy tails — sentence lengths follow approximately log-normal or Weibull distributions rather than the near-exponential distribution produced by LLMs sampling under temperature constraints. Detectors that fit parametric distributions to sentence-length sequences rather than computing summary statistics are more robust to adversarial manipulation.

2.3 Model Fingerprinting and Stylometric Analysis

Beyond aggregate statistics, individual LLMs exhibit consistent stylometric fingerprints arising from their training data composition, RLHF fine-tuning objectives, and architectural differences.

2.3.1 Lexical Fingerprints

Term	Observation
delve	Significantly overrepresented in Claude/GPT-4 output vs. human academic writing (7.3x frequency ratio)
certainly	Used as sentence-initial affirmation at rates >5x human baseline; strong GPT-3.5/4 marker
furthermore	Overused as paragraph-level connector; transition density is 2–4x human baseline
crucial	Appears in ~12% of GPT-4 paragraphs discussing importance vs. ~3% in human text
leverage (v.)	Business/technical writing marker; 3.8x overrepresentation in ChatGPT-generated content
tapestry	Metaphor overuse marker; specific to certain RLHF fine-tuning reward patterns

Lexical fingerprints are fragile — they shift with each model version and can be defeated by explicit prompting. They are most useful as corroborating evidence rather than primary detection signals.

2.3.2 Syntactic Fingerprints

More robust than lexical fingerprints, syntactic patterns reflect the model’s learned preferences which are more stable across fine-tuning versions: passive voice rate (2–3x informal human writing), nominal phrase density, shallower center-embedding than human academic text, overuse of parallel three-item list structures, and hedge clustering at paragraph boundaries rather than distributed throughout.

2.3.3 Discourse-Level Fingerprints

The most robust fingerprints operate at the discourse level: topic sentence prominence (>94% first-position vs. ~73% in human text), paragraph length homogeneity, transition word density at 1.8x human baseline, and more consistent claim-to-evidence ratio throughout a document.

2.4 Logit Watermarking

The most technically robust detection approach is logit watermarking (Kirchenbauer et al., 2023), which embeds a statistically detectable signal into the generation process itself. At each generation step, the vocabulary V is pseudorandomly partitioned into a ‘green’ list G and a ‘red’ list R using a hash of the preceding token as the seed:

G t = HashPartition(t i-k:i, γ)|V| where |G| = γ|V| Green list generation (γ = partition ratio)

During generation, logits for green-list tokens are boosted by a hardness parameter δ before softmax normalization:

z' j = z j + δ \cdot 1[j \in G t] for j \in V Logit perturbation step

Detection uses a one-sided z-test on the fraction of green tokens:

z = (|G tokens | - γn) / \sqrt(nγ(1-γ)) ~ N(0,1) under H₀ Watermark detection z-statistic

Critical Property: Watermark Robustness

Logit watermarking is qualitatively different from statistical fingerprinting because it does not rely on distributional assumptions that can be defeated by humanization. The watermark is embedded in the token sequence itself. Defeating it without access to the hash function and partition key requires rewriting enough tokens to destroy the statistical signal — which, if done thoroughly, constitutes the humanization problem itself. Current humanization approaches that preserve semantic content cannot reliably defeat strong watermarking schemes. However, no major model provider implements this at scale as of Q1 2026.

3. AI Text Humanization: Technical Approaches

3.1 Taxonomy of Humanization Methods

Approach	Processing Depth	Perplexity Impact	Burstiness Impact
Synonym substitution	Lexical (word-level)	Minimal — prior unchanged	None
Syntactic transformation	Syntactic (clause-level)	Moderate — some disruption	Slight improvement
Paraphrase model	Surface-semantic	Moderate — partial disruption	Moderate improvement
Full LLM rewrite	Semantic-pragmatic	High — new generative prior	Significant improvement

3.2 Paraphrase-Based Approaches: Why They Fail

The fundamental failure mode can be stated formally. Let t_AI be a machine-generated text and t_para = Paraphrase(t_AI) be its paraphrase:

PPL M (t para) \approx PPL M (t AI) when Paraphrase preserves semantic content Prior preservation theorem (informal)

This holds because the paraphrase model is itself an LLM trained to minimize reconstruction loss on the same distribution, meaning its outputs are also low-perplexity under the scoring model. Semantic-preserving paraphrasing constrains the output to the same region of meaning-space.

# Empirical validation: perplexity before/after paraphrase
original_ai_text = generate_with_gpt4(prompt)
paraphrased_text = quillbot_paraphrase(original_ai_text)

ppl_original = compute_perplexity(original_ai_text, gpt4_scoring_model)
ppl_paraphrased = compute_perplexity(paraphrased_text, gpt4_scoring_model)

# Typical observed result:
# ppl_original:    14.2
# ppl_paraphrased: 17.8 (<20% change — still within AI distribution)
# Threshold for detection: PPL < 25 → still flagged as AI

3.3 Full Generative Rewriting: The Cross-Model Approach

Full generative rewriting with a different LLM is the only approach that fundamentally addresses the generative prior problem:

Cross-Model Rewriting Theorem (informal)

If model M_A generates text t_A and model M_B (A ≠ B) rewrites it to produce t_B preserving semantic content, then PPL_{M_A}(t_B) is determined by M_B’s generative prior, not M_A’s. Since M_A and M_B have different generative priors arising from different training data, architectures, and RLHF objectives, t_B will have higher perplexity under M_A than t_A — potentially enough to defeat M_A-based detectors.

3.3.2 The Role of Temperature and Sampling Parameters

# Effect of temperature on humanization quality

temperature = 0.0   # Greedy decoding
# Result: deterministic output, PPL ≈ source model output
# Burstiness improvement: minimal

temperature = 0.7   # Standard setting
# Result: PPL increases 40-60% vs. source, B increases ~0.12

temperature = 1.0   # Full temperature
# Result: PPL increases 80-120% vs. source, B increases ~0.22
# Trade-off: more frequent quality degradation

# Optimal: high temperature + top-p=0.9 + repetition_penalty=1.1
# Maximizes distributional shift while maintaining fluency

3.3.3 Instruction Prompt Engineering for Humanization

# Effective humanization prompt structure
HUMANIZER_PROMPT = """
You are rewriting the following text to sound like it was written by a human.

Core requirements:
- Generate completely from your own language patterns — do not preserve
  the sentence structures of the original
- Vary sentence length significantly: mix 5-8 word sentences with
  25-40 word analytical sentences
- Include at least one moment of epistemic hedging per paragraph
  ('in most cases', 'tends to', 'the evidence suggests')
- Use contractions naturally throughout
- Avoid: 'furthermore', 'moreover', 'delve', 'crucial', 'certainly'
  as sentence-initial affirmations
- Preserve all factual content and logical structure
- The output should NOT begin with the same word as the input

Input text:
{source_text}
"""

4. Adversarial Dynamics: The Arms Race

4.1 The Detection-Humanization Game

Detection and humanization can be modeled as a two-player game. Let D_θ be a detector and H_φ a humanizer:

min φ max θ E[D θ (H φ (t AI))] - λ \cdot SemanticLoss(t AI, H φ (t AI)) Minimax humanization objective

In practice, neither player has access to the other’s parameters, so this is an incomplete information game. Humanizers optimize against observable detector behaviors; detectors are retrained as humanization strategies become known.

4.2 Current State of the Equilibrium (Q1 2026)

Perplexity-based detectors: Defeated by full model rewriting with a different-architecture model. Current full-rewrite humanizers tend to reduce detection rates to near-chance levels on well-formed prose.

Burstiness-based detectors: Partially defeated. Instruction-tuned humanizers that explicitly vary sentence length achieve B ∈ [0.15, 0.35], overlapping the human distribution substantially.

Stylometric fingerprint detectors: Defeated by negative prompting and full model rewriting, which replaces the source model’s fingerprint with the rewriting model’s.

Multi-feature ensemble classifiers: Partially defeated. Full rewrites address 60–80% of features but leave residual signal in discourse-level patterns.

Logit watermarking: Not currently defeated by any publicly available humanization tool. The watermark signal cannot be removed without access to the hash function.

4.3 False Positive Analysis

False Positive Driver	Magnitude	Mechanism
Non-native English writing	High (10–20% FP)	Controlled vocabulary + formal register produces same distributional footprint as LLM output
Formal academic writing	Moderate (5–15%)	Academic style conventions match LLM stylometric fingerprints
Domain-specific technical text	Moderate (5–10%)	Constrained vocabulary reduces entropy; precise sentences reduce burstiness
Heavily edited prose	Low-Moderate (3–8%)	Multiple editing passes smooth stylistic irregularities
List-heavy documents	High within sections	Itemized content has near-zero burstiness
Very short texts (<200 words)	Very high (15–40%)	Insufficient statistical mass for reliable classification

5. MultipleChat AI Humanizer: Technical Architecture

5.1 System Overview

The MultipleChat AI Humanizer implements a full cross-model rewriting pipeline with five architectural components: (1) source text preprocessing and segmentation, (2) model selection and prompt engineering, (3) cross-model generation with controlled sampling parameters, (4) post-generation quality verification, and (5) optional burstiness enhancement.

# Simplified pipeline architecture
class MultipleChat_Humanizer:
    def __init__(self, rewriter_model: str):
        self.model = load_model(rewriter_model)  # Claude, GPT-4, Gemini, Grok
        self.detector = load_detector('multiplechat-v2')
        self.segmenter = SemanticSegmenter()

    def humanize(self, source_text: str, instructions: str = '') -> HumanizationResult:
        # 1. Segment into semantically coherent chunks
        segments = self.segmenter.segment(source_text, max_tokens=4096)
        # 2. Construct humanization prompt
        prompt = build_humanization_prompt(segments, instructions, target_burstiness=0.25)
        # 3. Cross-model generation
        rewritten = self.model.generate(
            prompt, temperature=0.85, top_p=0.92,
            frequency_penalty=0.3, presence_penalty=0.2
        )
        # 4. Quality gate: semantic similarity check
        sim = semantic_similarity(source_text, rewritten)
        if sim < 0.75:
            rewritten = self.retry_with_stricter_prompt(source_text)
        # 5. Detection score computation for user feedback
        detection_score = self.detector.score(rewritten)
        return HumanizationResult(
            text=rewritten, detection_score=detection_score,
            burstiness=compute_burstiness(rewritten), semantic_similarity=sim
        )

5.2 Model Selection and Comparative Characteristics

Claude’s RLHF fine-tuning tends to produce output with higher hedge density and more varied syntactic constructions, yielding the highest burstiness improvement (mean B improvement: +0.19). GPT-4 produces shorter, more syntactically regular sentences — advantageous for technical content but with lower burstiness gains (mean B improvement: +0.12).

Key architectural principle

The humanization improvement is maximized when the rewriting model and the detector’s scoring model are architecturally distant. This is because perplexity-based detection measures how predictable the text is under the detector’s model — and text generated by a different model family is inherently less predictable. MultipleChat’s support for four distinct model families allows selection of the maximally distant rewriter relative to the likely detector.

5.3 The MultipleChat AI Detector

The MultipleChat detector implements a multi-feature ensemble classifier trained on outputs from all four supported models. The feature set includes perplexity features (computed against multiple scoring models), burstiness features (sentence-level and paragraph-level), lexical features (hedge density, AI marker count, contraction rate, passive voice rate), discourse features (topic sentence position, transition density, paragraph length variance), and entropy features (token unigram and bigram Shannon entropy).

# Per-sentence heatmap generation (simplified)
def generate_heatmap(text: str) -> list[SentenceScore]:
    sentences = sent_tokenize(text)
    scores = []
    for i, sent in enumerate(sentences):
        # Context window: 2 sentences before and after
        context_start = max(0, i-2)
        context_end = min(len(sentences), i+3)
        context = ' '.join(sentences[context_start:context_end])
        sent_features = {
            'local_ppl': perplexity(context, scoring_model),
            'sent_length': len(sent.split()),
            'has_hedge': bool(re.search(HEDGE_PATTERN, sent)),
            'ai_markers': count_ai_markers(sent, AI_LEXICAL_SET),
        }
        score = ensemble_classifier.predict_proba(sent_features)[1]
        scores.append(SentenceScore(sentence=sent, ai_prob=score))
    return scores

6. Failure Modes and Open Problems

6.1 Humanization Failure Modes

Semantic drift under aggressive humanization: High-temperature rewriting can produce semantic drift — the rewritten text preserves the general topic but loses specific factual claims or nuanced distinctions. The quality trade-off:

SemanticSimilarity ↓ as Δ(PPL) ↑ Humanization quality trade-off

The MultipleChat humanizer addresses this through a semantic similarity gate (cosine similarity > 0.75 in embedding space) that triggers a retry with a more constrained prompt if drift is detected.

Hallucination in factual content: The rewriting model may substitute specific facts, dates, or statistics with plausible but incorrect alternatives. Factual content requires human verification post-humanization. The pipeline does not currently implement automatic fact verification — this is flagged as an open problem.

Register and formality drift: Without explicit register specification, the rewriting model defaults to its own preferred register. Source texts that are highly formal (legal, medical) or highly informal (marketing, social) may not match without explicit instruction.

6.2 Detection Failure Modes

Domain shift: Detectors trained on general-domain corpora exhibit significant accuracy degradation on specialized domains. A detector with 92% accuracy on general web text may drop to 65–70% on medical case reports or legal contracts.

Length effects: For texts below 200 words, statistical estimates have confidence intervals too wide for reliable classification. Below 100 words, most detectors operate near chance levels.

Compositional texts: Documents interleaving human and AI sections present a mixture distribution that defeats document-level classifiers. Segment-level detection partially addresses this but requires significantly more computation.

6.3 Open Problems

Semantics-preserving watermarking: A scheme that embeds watermarks at the semantic level would be resistant to paraphrase attacks while remaining detectable through semantic analysis.

Cross-lingual detection: Most detectors are trained and evaluated on English text. Performance on other languages is substantially degraded, creating significant equity concerns.

Collaborative authorship attribution: Reliable tools for quantifying the human vs. AI contribution fraction in collaboratively authored text do not yet exist.

Adversarial training for robustness: A theoretical framework for characterizing the game-theoretic equilibrium and predicting detection accuracy at equilibrium does not currently exist.

Formal verification of factual preservation: No current approach reliably verifies that humanized text preserves all factual content of the source — technically a natural language inference (NLI) problem at scale.

7. Conclusions

1. Perplexity-based detection is theoretically well-founded but practically fragile. The distributional overlap between ESL human writing and LLM output makes false positive rates unacceptably high for consequential decisions at any detection threshold that maintains reasonable sensitivity.

2. Paraphrase-based humanization does not address the generative prior. Synonym substitution and syntactic transformation preserve the low-perplexity signature of LLM output under the source model. Full cross-model rewriting is the only approach that changes the generating distribution.

3. Multi-feature ensemble detection is more robust than single-feature approaches. Combining perplexity, burstiness, stylometric features, and discourse-level analysis substantially outperforms any individual feature.

4. Logit watermarking is the only technically robust detection approach. However, it requires implementation at the model provider level and has not been deployed at scale as of Q1 2026.

5. The adversarial equilibrium favors humanization over detection in the short term. Without logit watermarking or equivalent cryptographic provenance mechanisms, detection scores should be treated as probabilistic signals requiring corroborating evidence, not as definitive verdicts.

References

Goh, K. I., & Barabási, A. L. (2008). Burstiness and memory in complex systems. EPL (Europhysics Letters), 81(4), 48002.

Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A watermark for large language models. Proceedings of ICML 2023.

Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., & Finn, C. (2023). DetectGPT: Zero-shot machine-generated text detection using probability curvature. Proceedings of ICML 2023.

Zhao, X., Ananth, P., Li, L., & Yu, T. (2023). Provable robust watermarking for AI-generated text. arXiv:2306.17439.

Gehrmann, S., Strobelt, H., & Rush, A. M. (2019). GLTR: Statistical detection and visualization of generated text. Proceedings of ACL 2019 System Demonstrations.

Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., & Choi, Y. (2019). Defending against neural fake news. Advances in NeurIPS 32.

Ippolito, D., Duckworth, D., Callison-Burch, C., & Eck, D. (2020). Automatic detection of generated text is easiest when humans are fooled. Proceedings of ACL 2020.

Uchendu, A., Le, T., Shu, K., & Lee, D. (2020). Authorship attribution for neural text generation. Proceedings of EMNLP 2020.

Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns, 4(7), 100779.

Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2023). Can AI-generated text be reliably detected? arXiv:2303.11156.

Appendix A: Mathematical Notation Reference

Symbol	Definition
P_M(t)	Probability assigned to text t by language model M
PPL_M(t)	Perplexity of text t under model M
H(X)	Shannon entropy of random variable X
KL(P\|\|Q)	Kullback-Leibler divergence from Q to P
B	Burstiness coefficient (Goh & Barabási, 2008)
μ_L, σ_L	Mean and standard deviation of sentence length distribution
G_t, R_t	Green and red token lists at generation step t
γ	Green list partition ratio (typically 0.5)
δ	Logit boost parameter for watermark hardness
θ, φ	Detector and humanizer parameter vectors
S(t)	DetectGPT scoring function (log-probability ratio)
V	Vocabulary set of size \|V\|
z ∈ ℝ^\|V\|	Logit vector before softmax normalization
sim(t₁, t₂)	Semantic similarity (cosine in embedding space)

Appendix B: Empirical Benchmark Summary

Detection accuracy across humanization methods (GPTZero v4 as reference detector, n=500 texts per cell):

Humanization Method	Detect Rate	FPR (Human)	Semantic Sim.	Burstiness B
No humanization (raw AI)	91.3%	8.2%	1.00	-0.07
Synonym substitution	87.4%	8.5%	0.97	-0.05
QuillBot (Creative mode)	74.2%	9.1%	0.91	0.08
Full rewrite (Claude)	28.6%	8.8%	0.83	0.24
Full rewrite (GPT-4)	31.2%	8.4%	0.86	0.19
MultipleChat Humanizer	22.4%	9.0%	0.81	0.27
Human baseline	8.2%	—	—	0.31

Note: All figures are approximations based on controlled experimental conditions. Real-world detection rates vary significantly by domain, text length, and detector version. The human baseline FPR of 8.2% represents the irreducible false positive rate of GPTZero v4 on human-authored academic text.

MultipleChat AI Humanizer & Detector

Full cross-model rewriting with Claude, ChatGPT, Gemini, and Grok. Multi-feature ensemble detection with sentence-level heatmaps.

Try Humanizer Try Detector

perplexity detection burstiness analysis logit watermarking cross-model rewriting stylometric fingerprinting MultipleChat Research

Version	1.0
Classification	Public
Prepared by	MultipleChat Research