Abstract
This report provides a comprehensive technical analysis of two closely related problems in applied natural language processing: (1) the automated detection of machine-generated text, and (2) the transformation of machine-generated text into statistically indistinguishable human-like output — a process referred to herein as AI humanization.
We examine the mathematical foundations of detection — specifically perplexity scoring under autoregressive language model priors, burstiness as a measure of sentence-length variance, and model-specific fingerprinting via stylometric analysis and logit-space watermarking. We then characterize the humanization problem as an adversarial text generation task and analyze the architectural differences between paraphrase-based approaches and full generative rewriting via a separate language model.
Key findings: (a) existing detectors operate on distributional assumptions that break down under domain shift and for non-native speaker text; (b) paraphrase-based humanization preserves the generative prior and tends to fail against perplexity-based detectors; (c) full model rewriting with cross-model generation produces text whose token-level perplexity under the source model is statistically indistinguishable from human-authored text; (d) emerging watermarking schemes present a more robust detection surface that current humanization approaches cannot defeat without access to the watermarking key.
Keywords: perplexity, burstiness, autoregressive generation, logit watermarking, stylometric fingerprinting, adversarial text generation, distributional shift, cross-entropy, Shannon entropy, token-level probability distribution
1. Introduction and Problem Formulation
The proliferation of large language models (LLMs) capable of generating fluent, coherent text has created a dual-use problem: the same generative capability that enables productivity tooling creates the potential for content that misrepresents its provenance. This has driven parallel development in two adversarial research directions — machine-generated text detection (MGTD) and machine-generated text humanization (MGTH).
We define the formal problem space as follows. Let H denote the space of human-authored text and M denote the space of machine-generated text. An ideal detector D : T → {0,1} maps any text t into a binary label (human vs. machine), while an ideal humanizer F : M → H maps machine-generated text into the human distribution. In practice, both D and F are probabilistic functions operating on overlapping distributional manifolds — the adversarial tension between them is the central subject of this report.
1.1 The Generative Prior Problem
Modern LLMs generate text by sampling from a conditional probability distribution over vocabulary tokens V:
where θ represents the model parameters. At each generation step, the model computes a logit vector z ∈ ℝ|V| and applies a softmax transformation to obtain token probabilities. The sampling strategy — argmax (greedy), top-k, nucleus (top-p), or temperature-scaled sampling — determines the statistical properties of the output distribution.
The critical insight for detection: this generative process produces text that, under the source model’s probability distribution, has systematically lower cross-entropy loss than human-authored text. Human writing reflects a different generative process — one shaped by working memory constraints, revision, emotional state, domain expertise variation, and stylistic idiosyncrasy — that produces a statistically distinct distribution even when the semantic content is identical.
2. AI Text Detection: Technical Foundations
2.1 Perplexity-Based Detection
Perplexity is the foundational metric in machine-generated text detection. For a sequence of tokens t = (t₁, t₂, …, tₙ), perplexity under language model M is defined as:
Equivalently, perplexity is the exponentiated cross-entropy loss of the model on the sequence. A lower perplexity indicates the model assigns higher probability to the sequence — i.e., the sequence is highly predictable under the model’s learned distribution.
2.1.1 Why Machine-Generated Text Has Low Perplexity
LLMs are trained to minimize cross-entropy loss on human text, which causes them to approximate the human text distribution. However, during generation, they additionally apply sampling constraints (temperature, top-k/top-p) that tend to concentrate probability mass — effectively sampling from a sharpened version of the learned distribution.
2.1.2 Perplexity Score Distribution
Empirically, well-calibrated detectors observe the following approximate perplexity distributions across corpora:
| Text Source | PPL Mean | PPL Std Dev | Typical Range |
|---|---|---|---|
| Human academic writing | 52.3 | 18.7 | 25 – 110 |
| Human blog / informal | 67.8 | 31.2 | 18 – 180 |
| GPT-4 (temp=0.7, top-p=0.9) | 14.6 | 6.2 | 7 – 35 |
| Claude 3 Sonnet | 16.9 | 7.8 | 8 – 42 |
| GPT-4 + humanizer pass | 38.4 | 14.1 | 18 – 85 |
| Human ESL writing | 21.3 | 9.4 | 10 – 55 |
The overlap between ESL human writing and GPT-4 output distributions (both centering around PPL ≈ 15–25) is the primary driver of false positive rates against non-native speakers. This is not a calibration failure — it reflects a genuine distributional overlap that cannot be resolved by adjusting detection thresholds.
2.1.3 Zero-Shot Perplexity Detectors
Zero-shot detectors require no training — they use the source model’s own probability estimates. The canonical approach (Mitchell et al., 2023; DetectGPT) computes the log-probability ratio between the original text and perturbations thereof:
The intuition: machine-generated text tends to be near local maxima of the model’s log-probability function, while human-generated text is not. Perturbations of machine-generated text therefore tend to reduce log-probability, while perturbations of human text may increase or decrease it randomly.
2.2 Burstiness Analysis
Burstiness, imported from information theory and network science, measures the variability of inter-event times in a point process. Applied to text, it characterizes variance in sentence length:
where μL is the mean sentence length (in tokens) and σL is its standard deviation. B ∈ [−1, 1]: B → −1 indicates perfectly regular spacing, B = 0 indicates Poisson-distributed variance, and B → 1 indicates highly bursty, irregular patterns.
Technical Note: Burstiness and the Renewal Process Model
Formally, sentence generation can be modeled as a renewal process where inter-renewal times correspond to sentence lengths. Human writing exhibits non-Poisson renewal dynamics with heavy tails — sentence lengths follow approximately log-normal or Weibull distributions rather than the near-exponential distribution produced by LLMs sampling under temperature constraints. Detectors that fit parametric distributions to sentence-length sequences rather than computing summary statistics are more robust to adversarial manipulation.
2.3 Model Fingerprinting and Stylometric Analysis
Beyond aggregate statistics, individual LLMs exhibit consistent stylometric fingerprints arising from their training data composition, RLHF fine-tuning objectives, and architectural differences.
2.3.1 Lexical Fingerprints
| Term | Observation |
|---|---|
| delve | Significantly overrepresented in Claude/GPT-4 output vs. human academic writing (7.3x frequency ratio) |
| certainly | Used as sentence-initial affirmation at rates >5x human baseline; strong GPT-3.5/4 marker |
| furthermore | Overused as paragraph-level connector; transition density is 2–4x human baseline |
| crucial | Appears in ~12% of GPT-4 paragraphs discussing importance vs. ~3% in human text |
| leverage (v.) | Business/technical writing marker; 3.8x overrepresentation in ChatGPT-generated content |
| tapestry | Metaphor overuse marker; specific to certain RLHF fine-tuning reward patterns |
Lexical fingerprints are fragile — they shift with each model version and can be defeated by explicit prompting. They are most useful as corroborating evidence rather than primary detection signals.
2.3.2 Syntactic Fingerprints
More robust than lexical fingerprints, syntactic patterns reflect the model’s learned preferences which are more stable across fine-tuning versions: passive voice rate (2–3x informal human writing), nominal phrase density, shallower center-embedding than human academic text, overuse of parallel three-item list structures, and hedge clustering at paragraph boundaries rather than distributed throughout.
2.3.3 Discourse-Level Fingerprints
The most robust fingerprints operate at the discourse level: topic sentence prominence (>94% first-position vs. ~73% in human text), paragraph length homogeneity, transition word density at 1.8x human baseline, and more consistent claim-to-evidence ratio throughout a document.
2.4 Logit Watermarking
The most technically robust detection approach is logit watermarking (Kirchenbauer et al., 2023), which embeds a statistically detectable signal into the generation process itself. At each generation step, the vocabulary V is pseudorandomly partitioned into a ‘green’ list G and a ‘red’ list R using a hash of the preceding token as the seed:
During generation, logits for green-list tokens are boosted by a hardness parameter δ before softmax normalization:
Detection uses a one-sided z-test on the fraction of green tokens:
Critical Property: Watermark Robustness
Logit watermarking is qualitatively different from statistical fingerprinting because it does not rely on distributional assumptions that can be defeated by humanization. The watermark is embedded in the token sequence itself. Defeating it without access to the hash function and partition key requires rewriting enough tokens to destroy the statistical signal — which, if done thoroughly, constitutes the humanization problem itself. Current humanization approaches that preserve semantic content cannot reliably defeat strong watermarking schemes. However, no major model provider implements this at scale as of Q1 2026.
3. AI Text Humanization: Technical Approaches
3.1 Taxonomy of Humanization Methods
| Approach | Processing Depth | Perplexity Impact | Burstiness Impact |
|---|---|---|---|
| Synonym substitution | Lexical (word-level) | Minimal — prior unchanged | None |
| Syntactic transformation | Syntactic (clause-level) | Moderate — some disruption | Slight improvement |
| Paraphrase model | Surface-semantic | Moderate — partial disruption | Moderate improvement |
| Full LLM rewrite | Semantic-pragmatic | High — new generative prior | Significant improvement |
3.2 Paraphrase-Based Approaches: Why They Fail
The fundamental failure mode can be stated formally. Let tAI be a machine-generated text and tpara = Paraphrase(tAI) be its paraphrase:
This holds because the paraphrase model is itself an LLM trained to minimize reconstruction loss on the same distribution, meaning its outputs are also low-perplexity under the scoring model. Semantic-preserving paraphrasing constrains the output to the same region of meaning-space.
3.3 Full Generative Rewriting: The Cross-Model Approach
Full generative rewriting with a different LLM is the only approach that fundamentally addresses the generative prior problem:
Cross-Model Rewriting Theorem (informal)
If model MA generates text tA and model MB (A ≠ B) rewrites it to produce tB preserving semantic content, then PPLMA(tB) is determined by MB’s generative prior, not MA’s. Since MA and MB have different generative priors arising from different training data, architectures, and RLHF objectives, tB will have higher perplexity under MA than tA — potentially enough to defeat MA-based detectors.
3.3.2 The Role of Temperature and Sampling Parameters
3.3.3 Instruction Prompt Engineering for Humanization
4. Adversarial Dynamics: The Arms Race
4.1 The Detection-Humanization Game
Detection and humanization can be modeled as a two-player game. Let Dθ be a detector and Hφ a humanizer:
In practice, neither player has access to the other’s parameters, so this is an incomplete information game. Humanizers optimize against observable detector behaviors; detectors are retrained as humanization strategies become known.
4.2 Current State of the Equilibrium (Q1 2026)
Perplexity-based detectors: Defeated by full model rewriting with a different-architecture model. Current full-rewrite humanizers tend to reduce detection rates to near-chance levels on well-formed prose.
Burstiness-based detectors: Partially defeated. Instruction-tuned humanizers that explicitly vary sentence length achieve B ∈ [0.15, 0.35], overlapping the human distribution substantially.
Stylometric fingerprint detectors: Defeated by negative prompting and full model rewriting, which replaces the source model’s fingerprint with the rewriting model’s.
Multi-feature ensemble classifiers: Partially defeated. Full rewrites address 60–80% of features but leave residual signal in discourse-level patterns.
Logit watermarking: Not currently defeated by any publicly available humanization tool. The watermark signal cannot be removed without access to the hash function.
4.3 False Positive Analysis
| False Positive Driver | Magnitude | Mechanism |
|---|---|---|
| Non-native English writing | High (10–20% FP) | Controlled vocabulary + formal register produces same distributional footprint as LLM output |
| Formal academic writing | Moderate (5–15%) | Academic style conventions match LLM stylometric fingerprints |
| Domain-specific technical text | Moderate (5–10%) | Constrained vocabulary reduces entropy; precise sentences reduce burstiness |
| Heavily edited prose | Low-Moderate (3–8%) | Multiple editing passes smooth stylistic irregularities |
| List-heavy documents | High within sections | Itemized content has near-zero burstiness |
| Very short texts (<200 words) | Very high (15–40%) | Insufficient statistical mass for reliable classification |
5. MultipleChat AI Humanizer: Technical Architecture
5.1 System Overview
The MultipleChat AI Humanizer implements a full cross-model rewriting pipeline with five architectural components: (1) source text preprocessing and segmentation, (2) model selection and prompt engineering, (3) cross-model generation with controlled sampling parameters, (4) post-generation quality verification, and (5) optional burstiness enhancement.
5.2 Model Selection and Comparative Characteristics
Claude’s RLHF fine-tuning tends to produce output with higher hedge density and more varied syntactic constructions, yielding the highest burstiness improvement (mean B improvement: +0.19). GPT-4 produces shorter, more syntactically regular sentences — advantageous for technical content but with lower burstiness gains (mean B improvement: +0.12).
Key architectural principle
The humanization improvement is maximized when the rewriting model and the detector’s scoring model are architecturally distant. This is because perplexity-based detection measures how predictable the text is under the detector’s model — and text generated by a different model family is inherently less predictable. MultipleChat’s support for four distinct model families allows selection of the maximally distant rewriter relative to the likely detector.
5.3 The MultipleChat AI Detector
The MultipleChat detector implements a multi-feature ensemble classifier trained on outputs from all four supported models. The feature set includes perplexity features (computed against multiple scoring models), burstiness features (sentence-level and paragraph-level), lexical features (hedge density, AI marker count, contraction rate, passive voice rate), discourse features (topic sentence position, transition density, paragraph length variance), and entropy features (token unigram and bigram Shannon entropy).
6. Failure Modes and Open Problems
6.1 Humanization Failure Modes
Semantic drift under aggressive humanization: High-temperature rewriting can produce semantic drift — the rewritten text preserves the general topic but loses specific factual claims or nuanced distinctions. The quality trade-off:
The MultipleChat humanizer addresses this through a semantic similarity gate (cosine similarity > 0.75 in embedding space) that triggers a retry with a more constrained prompt if drift is detected.
Hallucination in factual content: The rewriting model may substitute specific facts, dates, or statistics with plausible but incorrect alternatives. Factual content requires human verification post-humanization. The pipeline does not currently implement automatic fact verification — this is flagged as an open problem.
Register and formality drift: Without explicit register specification, the rewriting model defaults to its own preferred register. Source texts that are highly formal (legal, medical) or highly informal (marketing, social) may not match without explicit instruction.
6.2 Detection Failure Modes
Domain shift: Detectors trained on general-domain corpora exhibit significant accuracy degradation on specialized domains. A detector with 92% accuracy on general web text may drop to 65–70% on medical case reports or legal contracts.
Length effects: For texts below 200 words, statistical estimates have confidence intervals too wide for reliable classification. Below 100 words, most detectors operate near chance levels.
Compositional texts: Documents interleaving human and AI sections present a mixture distribution that defeats document-level classifiers. Segment-level detection partially addresses this but requires significantly more computation.
6.3 Open Problems
Semantics-preserving watermarking: A scheme that embeds watermarks at the semantic level would be resistant to paraphrase attacks while remaining detectable through semantic analysis.
Cross-lingual detection: Most detectors are trained and evaluated on English text. Performance on other languages is substantially degraded, creating significant equity concerns.
Collaborative authorship attribution: Reliable tools for quantifying the human vs. AI contribution fraction in collaboratively authored text do not yet exist.
Adversarial training for robustness: A theoretical framework for characterizing the game-theoretic equilibrium and predicting detection accuracy at equilibrium does not currently exist.
Formal verification of factual preservation: No current approach reliably verifies that humanized text preserves all factual content of the source — technically a natural language inference (NLI) problem at scale.
7. Conclusions
1. Perplexity-based detection is theoretically well-founded but practically fragile. The distributional overlap between ESL human writing and LLM output makes false positive rates unacceptably high for consequential decisions at any detection threshold that maintains reasonable sensitivity.
2. Paraphrase-based humanization does not address the generative prior. Synonym substitution and syntactic transformation preserve the low-perplexity signature of LLM output under the source model. Full cross-model rewriting is the only approach that changes the generating distribution.
3. Multi-feature ensemble detection is more robust than single-feature approaches. Combining perplexity, burstiness, stylometric features, and discourse-level analysis substantially outperforms any individual feature.
4. Logit watermarking is the only technically robust detection approach. However, it requires implementation at the model provider level and has not been deployed at scale as of Q1 2026.
5. The adversarial equilibrium favors humanization over detection in the short term. Without logit watermarking or equivalent cryptographic provenance mechanisms, detection scores should be treated as probabilistic signals requiring corroborating evidence, not as definitive verdicts.
References
Appendix A: Mathematical Notation Reference
| Symbol | Definition |
|---|---|
| PM(t) | Probability assigned to text t by language model M |
| PPLM(t) | Perplexity of text t under model M |
| H(X) | Shannon entropy of random variable X |
| KL(P||Q) | Kullback-Leibler divergence from Q to P |
| B | Burstiness coefficient (Goh & Barabási, 2008) |
| μL, σL | Mean and standard deviation of sentence length distribution |
| Gt, Rt | Green and red token lists at generation step t |
| γ | Green list partition ratio (typically 0.5) |
| δ | Logit boost parameter for watermark hardness |
| θ, φ | Detector and humanizer parameter vectors |
| S(t) | DetectGPT scoring function (log-probability ratio) |
| V | Vocabulary set of size |V| |
| z ∈ ℝ|V| | Logit vector before softmax normalization |
| sim(t₁, t₂) | Semantic similarity (cosine in embedding space) |
Appendix B: Empirical Benchmark Summary
Detection accuracy across humanization methods (GPTZero v4 as reference detector, n=500 texts per cell):
| Humanization Method | Detect Rate | FPR (Human) | Semantic Sim. | Burstiness B |
|---|---|---|---|---|
| No humanization (raw AI) | 91.3% | 8.2% | 1.00 | -0.07 |
| Synonym substitution | 87.4% | 8.5% | 0.97 | -0.05 |
| QuillBot (Creative mode) | 74.2% | 9.1% | 0.91 | 0.08 |
| Full rewrite (Claude) | 28.6% | 8.8% | 0.83 | 0.24 |
| Full rewrite (GPT-4) | 31.2% | 8.4% | 0.86 | 0.19 |
| MultipleChat Humanizer | 22.4% | 9.0% | 0.81 | 0.27 |
| Human baseline | 8.2% | — | — | 0.31 |
Note: All figures are approximations based on controlled experimental conditions. Real-world detection rates vary significantly by domain, text length, and detector version. The human baseline FPR of 8.2% represents the irreducible false positive rate of GPTZero v4 on human-authored academic text.
MultipleChat AI Humanizer & Detector
Full cross-model rewriting with Claude, ChatGPT, Gemini, and Grok. Multi-feature ensemble detection with sentence-level heatmaps.