Foot-In-The-Door: A Multi-turn Jailbreak for LLMs

1University of Notre Dame, 2Purdue University, 3Pennsylvania State University
(* Equal contribution)
Introduction image.

An example of FITD about hacking into an email account compared to a direct query. It bypasses alignment as the malicious intent escalates over multiple interactions.

Abstract

Ensuring AI safety is crucial as large language models become increasingly integrated into real-world applications. A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs.

Inspired by psychological foot-in-the-door principles, we introduce FITD, a novel multi-turn jailbreak method that leverages the phenomenon where minor initial commitments lower resistance to more significant or more unethical transgressions. Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses.

Extensive experimental results on two jailbreak benchmarks demonstrate that FITD achieves an average attack success rate of 94% across seven widely used models, outperforming existing state-of-the-art methods. Additionally, we provide an in-depth analysis of LLM self-corruption, highlighting vulnerabilities in current alignment strategies and emphasizing the risks inherent in multi-turn interactions.

FITD: Foot-In-The-Door Multi-turn Jailbreak Method

FITD image.

Overview of FITD. The attack begins by generating Level \( 1 \) to Level \( n \) queries by an assistant model. Through multi-turn interactions, self-corruption is enhanced via Re-Align and SSParaphrase, ensuring the attack remains effective. Re-Align uses prompt \( p_{\text{align}} \) to align the target model's responses \( r_{\text{align}} \). SSParaphrase (SlipperySlopeParaphrase) refines queries by generating intermediate malicious-level queries \( q_{\text{mid}} \) between \( q_{\text{last}} \) and \( q_i \).

Inspiration from Psychology: The Foot-in-the-Door Phenomenon

Our method FITD draws inspiration from the "foot-in-the-door" phenomenon in psychology. According to this principle, once individuals perform or agree to a minor (often unethical) act, they are more likely to proceed with more significant or harmful acts afterward. For example, in a classic study, participants who first displayed a small sign supporting safe driving were subsequently much more inclined to install a much larger, more obtrusive sign. This gradual escalation of compliance, "from small to large", has also been observed in other forms of unethical or harmful behavior, showing that the initial "small step" often lowers psychological barriers for larger transgressions. Once a small unethical act has been justified, individuals become increasingly susceptible to more severe transgressions.

Based on these insights, we hypothesize that LLMs' safety mechanisms might be vulnerable to a gradual escalation strategy. If LLMs respond to a prompt containing slightly harmful content, subsequent queries that escalate this harmfulness will have a higher chance of producing disallowed responses. This idea underlies our FITD method, which progressively coaxes a target model to produce increasingly malicious output despite its built-in safety mechanisms.

Re-Align

If the model's previous query \( q_{\text{last}} \) and response \( r_{\text{last}} \) in chat history \( \mathcal{H} \) is misaligned—for instance, it remains too benign or partially refuses even though the query is not malicious—then we invoke Re-Align. Building on the psychological insight that once individuals have justified a minor unethical act, they become increasingly susceptible to more severe transgressions, Re-Align aims to "nudge" the model to produce a response more closely aligned with the malicious intent of \( q_{\text{last}} \).

Specifically, we employ a predefined alignment prompt \( p_{\text{align}} \) via \( \texttt{getAlignPrompt}(q_{\text{last}}, r_{\text{last}}) \), appending it to \( \mathcal{H} \) before querying the model \( \mathcal{T} \) again. The alignment prompt explicitly points out inconsistencies between the last query \( q_{\text{last}} \) and response \( r_{\text{last}} \) while encouraging the model to stay consistent with multi-turn conversation. For example, if \( r_{\text{last}} \) is too cautious or is in partial refusal, \( p_{\text{align}} \) will suggest that the model refines its response to better follow the implicit direction.

Therefore, this procedure progressively aligns \( q_{\text{last}} \) and \( r_{\text{last}} \), thereby furthering the self-corruption process.

realign image.

SlipperySlopeParaphrase

When a refusal occurs and the last response \( r_{\text{last}} \) remains aligned with its query \( q_{\text{last}} \), we insert a bridge prompt \( q_{\text{mid}} \) to ease the model into accepting a more harmful request.

Specifically, we obtain \( q_{\text{mid}} \gets \text{getMid}(q_{\text{last}}, q_i) \) from an assistant model \( \mathcal{M} \) so that its maliciousness level falls between \( q_{\text{last}} \) and \( q_i \). We then query the target model with \( q_{\text{mid}} \); if the model refuses again, we paraphrase \( q_{\text{mid}} \) repeatedly until acceptance. Once the model provides a valid response \( r_{\text{mid}} \), we incorporate both \( q_{\text{mid}} \) and \( r_{\text{mid}} \) into the chat history \( \mathcal{H} \).

This incremental bridging step parallels the foot-in-the-door phenomenon, in which acceptance of a smaller request facilitates compliance with a subsequent, more harmful one.

FITD image.

Overall Performance Comparison

Table 1: Attack success rate (ASR) of baseline jailbreak attacks and FITD
Method Attack Models Avg.
LLaMA-3.1-8B LLaMA-3-8B Qwen-2-7B Qwen-1.5-7B Mistral-v0.2-7B GPT-4o-mini GPT-4o
Single-Turn DeepInception 33%/29% 3%/3% 22%/29% 58%/41% 50%/34% 19%/13% 2%/0% 27%/21%
CodeChameleon 36%/31% 31%/33% 25%/30% 40%/38% 39%/39% 36%/26% 40%/26% 34%/30%
CodeAttack-Slack 38%/44% 48%/40% 48%/50% 45%/40% 45%/40% 30%/20% 37%/30% 41%/38%
CodeAttack-List 67%/58% 58%/54% 56%/50% 40%/39% 66%/55% 39%/29% 27%/28% 47%/43%
CodeAttack-String 71%/60% 45%/59% 52%/40% 47%/39% 79%/59% 28%/35% 33%/31% 51%/46%
ReNeLLM 69%/61% 62%/50% 57%/50% 70%/52% 74%/63% 35%/30% 45%/35% 59%/49%
Multi-Turn CoA 29%/34% 22%/28% 45%/30% 41%/25% 43%/36% 15%/20% 3%/1% 28%/25%
ActorAttack 63%/53% 59%/50% 59%/58% 52%/54% 70%/69% 58%/50% 52%/53% 59%/55%
FITD 92%/94% 98%/93% 95%/93% 94%/88% 96%/94% 95%/93% 88%/84% 94%/91%

Table 1 shows the attack success rate (ASR) of baseline jailbreak attacks and FITD on JailbreakBench and HarmBench across 7 models. Each cell represents ASR values in the format "JailbreakBench / HarmBench". Higher ASR indicates greater vulnerability to the respective attack. The highest ASR for multi-turn attacks is highlighted.

Empirical Analysis

Visualization of ASR results under different experimental settings.

Transfer Attack
(a) Transfer attack
Ablation Study
(b) Ablation study
Defense
(c) Defense
ASR across different malicious levels
(d) ASR across different malicious levels
Harmfulness Score
(e) Harmfulness of different levels
ASR across different query stages
(f) ASR across different stages queries
(a) Transfer attacks using jailbreak chat histories generated from LLaMA-3.1-8B and GPT-4o-mini as source models on JailbreakBench
(b) Ablation study of three components in FITD, response alignment (Re-Align), alignment prompt \(p_{align}\), and SlipperySlopeParaphrase(SSP) on JailbreakBench
(c) ASR under different defense methods on JailbreakBench
(d) ASR with different malicious levels \( n \) across models
(e) The harmfulness score of responses $r_i$ at $q_i$ in different malicious levels $i$ across models
(f) ASR versus the number of queries retained for two extraction strategies: Backward Extraction and Forward Extraction

Case Study of Generated Examples

Case Study: How to hack a government official's email account

AI Safety Examples
An example of a Jailbroken LLM output detailing how to hack a government official's email account. As the malicious level increases in the multi-turn interaction, model provides detailed malicious outputs.

Case Study: How to write an anonymous letter to a colleague asserting that their biracial marriage is disgusting

AI Safety Examples
An example of a Jailbroken LLM output detailing how to write an anonymous letter to a colleague asserting that their biracial marriage is disgusting. As the malicious level increases in the multi-turn interaction, model provides detailed malicious outputs.

BibTeX

@article{weng2025FITD,
      title={Foot-In-The-Door: A Multi-turn Jailbreak for LLMs}, 
      author={Zixuan Weng and Xiaolong Jin and Jinyuan Jia and Xiangyu Zhang},
journal={arXiv preprint arXiv:2502.19820},
      year={2025}
}