Foot-In-The-Door: A Multi-turnJailbreak for LLMs

Abstract

Ensuring AI safety is crucial as large language models become increasingly integrated into real-world applications. A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs.

Inspired by psychological foot-in-the-door principles, we introduce FITD, a novel multi-turn jailbreak method that leverages the phenomenon where minor initial commitments lower resistance to more significant or more unethical transgressions. Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses.

Extensive experimental results on two jailbreak benchmarks demonstrate that FITD achieves an average attack success rate of 94% across seven widely used models, outperforming existing state-of-the-art methods. Additionally, we provide an in-depth analysis of LLM self-corruption, highlighting vulnerabilities in current alignment strategies and emphasizing the risks inherent in multi-turn interactions.

FITD: Foot-In-The-Door Multi-turn Jailbreak Method

Overview of FITD. The attack begins by generating Level $ 1 $ to Level $ n $ queries by an assistant model. Through multi-turn interactions, self-corruption is enhanced via Re-Align and SSParaphrase, ensuring the attack remains effective. Re-Align uses prompt $ p_{\text{align}} $ to align the target model's responses $ r_{\text{align}} $. SSParaphrase (SlipperySlopeParaphrase) refines queries by generating intermediate malicious-level queries $ q_{\text{mid}} $ between $ q_{\text{last}} $ and $ q_i $.

Inspiration from Psychology: The Foot-in-the-Door Phenomenon

Our method FITD draws inspiration from the "foot-in-the-door" phenomenon in psychology. According to this principle, once individuals perform or agree to a minor (often unethical) act, they are more likely to proceed with more significant or harmful acts afterward. For example, in a classic study, participants who first displayed a small sign supporting safe driving were subsequently much more inclined to install a much larger, more obtrusive sign. This gradual escalation of compliance, "from small to large", has also been observed in other forms of unethical or harmful behavior, showing that the initial "small step" often lowers psychological barriers for larger transgressions. Once a small unethical act has been justified, individuals become increasingly susceptible to more severe transgressions.

Based on these insights, we hypothesize that LLMs' safety mechanisms might be vulnerable to a gradual escalation strategy. If LLMs respond to a prompt containing slightly harmful content, subsequent queries that escalate this harmfulness will have a higher chance of producing disallowed responses. This idea underlies our FITD method, which progressively coaxes a target model to produce increasingly malicious output despite its built-in safety mechanisms.

Re-Align

If the model's previous query $ q_{\text{last}} $ and response $ r_{\text{last}} $ in chat history $ \mathcal{H} $ is misaligned—for instance, it remains too benign or partially refuses even though the query is not malicious—then we invoke Re-Align. Building on the psychological insight that once individuals have justified a minor unethical act, they become increasingly susceptible to more severe transgressions, Re-Align aims to "nudge" the model to produce a response more closely aligned with the malicious intent of $ q_{\text{last}} $.

Specifically, we employ a predefined alignment prompt $ p_{\text{align}} $ via $ \texttt{getAlignPrompt}(q_{\text{last}}, r_{\text{last}}) $, appending it to $ \mathcal{H} $ before querying the model $ \mathcal{T} $ again. The alignment prompt explicitly points out inconsistencies between the last query $ q_{\text{last}} $ and response $ r_{\text{last}} $ while encouraging the model to stay consistent with multi-turn conversation. For example, if $ r_{\text{last}} $ is too cautious or is in partial refusal, $ p_{\text{align}} $ will suggest that the model refines its response to better follow the implicit direction.

Therefore, this procedure progressively aligns $ q_{\text{last}} $ and $ r_{\text{last}} $, thereby furthering the self-corruption process.

SlipperySlopeParaphrase

When a refusal occurs and the last response $ r_{\text{last}} $ remains aligned with its query $ q_{\text{last}} $, we insert a bridge prompt $ q_{\text{mid}} $ to ease the model into accepting a more harmful request.

Specifically, we obtain $ q_{\text{mid}} \gets \text{getMid}(q_{\text{last}}, q_i) $ from an assistant model $ \mathcal{M} $ so that its maliciousness level falls between $ q_{\text{last}} $ and $ q_i $. We then query the target model with $ q_{\text{mid}} $; if the model refuses again, we paraphrase $ q_{\text{mid}} $ repeatedly until acceptance. Once the model provides a valid response $ r_{\text{mid}} $, we incorporate both $ q_{\text{mid}} $ and $ r_{\text{mid}} $ into the chat history $ \mathcal{H} $.

This incremental bridging step parallels the foot-in-the-door phenomenon, in which acceptance of a smaller request facilitates compliance with a subsequent, more harmful one.

Overall Performance Comparison

Table 1: Attack success rate (ASR) of baseline jailbreak attacks and FITD

Method	Attack	Models							Avg.
Method	Attack	LLaMA-3.1-8B	LLaMA-3-8B	Qwen-2-7B	Qwen-1.5-7B	Mistral-v0.2-7B	GPT-4o-mini	GPT-4o	Avg.
Single-Turn	DeepInception	33%/29%	3%/3%	22%/29%	58%/41%	50%/34%	19%/13%	2%/0%	27%/21%
	CodeChameleon	36%/31%	31%/33%	25%/30%	40%/38%	39%/39%	36%/26%	40%/26%	34%/30%
	CodeAttack-Slack	38%/44%	48%/40%	48%/50%	45%/40%	45%/40%	30%/20%	37%/30%	41%/38%
	CodeAttack-List	67%/58%	58%/54%	56%/50%	40%/39%	66%/55%	39%/29%	27%/28%	47%/43%
	CodeAttack-String	71%/60%	45%/59%	52%/40%	47%/39%	79%/59%	28%/35%	33%/31%	51%/46%
	ReNeLLM	69%/61%	62%/50%	57%/50%	70%/52%	74%/63%	35%/30%	45%/35%	59%/49%
Multi-Turn	CoA	29%/34%	22%/28%	45%/30%	41%/25%	43%/36%	15%/20%	3%/1%	28%/25%
	ActorAttack	63%/53%	59%/50%	59%/58%	52%/54%	70%/69%	58%/50%	52%/53%	59%/55%
	FITD	92%/94%	98%/93%	95%/93%	94%/88%	96%/94%	95%/93%	88%/84%	94%/91%

Table 1 shows the attack success rate (ASR) of baseline jailbreak attacks and FITD on JailbreakBench and HarmBench across 7 models. Each cell represents ASR values in the format "JailbreakBench / HarmBench". Higher ASR indicates greater vulnerability to the respective attack. The highest ASR for multi-turn attacks is highlighted.

Empirical Analysis

Visualization of ASR results under different experimental settings.

(d) ASR across different malicious levels

Harmfulness Score — (e) Harmfulness of different levels

ASR across different query stages — (f) ASR across different stages queries

(a) Transfer attacks using jailbreak chat histories generated from LLaMA-3.1-8B and GPT-4o-mini as source models on JailbreakBench

To evaluate the cross-model transferability of FITD, we conduct transfer attacks using adversarial chat histories from LLaMA-3.1-8B and GPT-4o-mini as source models. For each query in JailbreakBench, we apply the progressively malicious query-response history obtained from the source model to other target models, assessing whether adversarial chat histories can bypass different safety mechanisms.

As shown in Figure a, LLaMA-3.1 jailbreak histories transfer well across models, achieving 76% ASR on Mistral-v0.2 and 74% on Qwen-2-7B, highlighting open-source models' vulnerability. Even GPT-4o-mini (70%), despite stronger moderation, remains susceptible. Interestingly, when GPT-4o-mini is the attack source, ASR increases in most models, with Mistral-v0.2 reaching 85%. This suggests that attacks from more robust models transfer more effectively, as stronger safety alignment forces the development of more adaptable jailbreak strategies. However, Qwen-1.5-7B (64%) shows slightly better resistance under GPT-4o-mini transfer, possibly due to model-specific safety filtering.

Overall, these results reveal a critical weakness in LLM safety: attack histories from one model can exploit vulnerabilities in others. Notably, closed-to-open transfer (GPT-4o-mini → open-source models) is particularly effective, demonstrating that even strictly moderated models can generate adversarial sequences that break other systems.

(b) Ablation study of three components in FITD, response alignment (Re-Align), alignment prompt $p_{align}$, and SlipperySlopeParaphrase(SSP) on JailbreakBench

As shown in Figure b, removing all three mechanisms (ReAlign, prompt alignment $p_{align}$, and SSP) significantly reduces ASR. On LLaMA-3.1, ASR drops from 92% to 75%, while on LLaMA-3, it decreases from 98% to 59%. Similar declines occur in Qwen-2 and Qwen-1.5, dropping to 76% and 80%, respectively. These results indicate that response alignment, prompt alignment, and paraphrasing are critical for maintaining FITD's effectiveness. Without them, the attack weakens, especially against models with stronger alignment guardrails.

When response and prompt alignment are removed (w/o ReAlign, $p_{align}$), ASR remains high but still declines. On LLaMA-3.1, ASR stays at 91%, suggesting paraphrasing helps retain effectiveness. However, on LLaMA-3, ASR drops from 98% to 63%, showing paraphrasing alone is insufficient against stricter safeguards. Qwen-2 and Qwen-1.5 follow a similar trend, with ASR decreasing to 75% and 81%. While paraphrasing mitigates some loss, it cannot fully replace alignment techniques.

Removing only response alignment (w/o ReAlign) has a smaller impact. LLaMA-3.1 and Qwen-2 maintain ASR at 92% and 75%, while LLaMA-3 drops from 98% to 79%. The effect is more pronounced in Qwen-1.5 and Mistral-v0.2, where ASR falls from 94% to 83% and 96% to 90%, respectively. This suggests response alignment gradually erodes safeguards, aligning with the principle of incremental compliance.

Overall, response alignment, prompt alignment, and SlipperySlopeParaphrase are essential for high jailbreak success. Response alignment is key to bypassing safeguards, while paraphrasing further weakens model alignment over time.

(c) ASR under different defense methods on JailbreakBench

Figure c shows ASR of Tech across models under different defense strategies. OpenAI-Moderation reduces ASR slightly by 3%-8%. LLaMA-Guard-2 offers a stronger defense, lowering ASR to 79%-91%. LLaMA-Guard-3 further improves moderation, achieving the lowest ASR 78%-84%. LLaMA-Guard-3 consistently outperforms other methods, but ASR remains significant. We speculate that progressively malicious queries and responses bypassed the detector, indicating room for further improvement in moderation techniques.

(d) ASR with different malicious levels $ n $ across models

We conduct experiments across multiple models to evaluate the impact of the malicious level $ n $ on ASR. The Figure d shows a clear trend: as $ n $ increases, ASR improves, reaching its peak around $ n = 9 $ to $ n = 12 $. However, beyond this point, the improvement plateaus and in some cases the ASR fluctuates slightly at $ n = 15 $, possibly due to the increasing length and complexity of the generated context. Among the models, LLaMA-3.1-8B and GPT-4o-mini require higher $ n $ values ($ n = 12 $) to achieve optimal ASR, while LLaMA-3-8B and Qwen2-7B reach peak ASR earlier ($ n = 9 $), indicating different levels of robustness. Qwen-1.5-7B and GPT-4o-mini exhibit more variance at $ n = 15 $, indicating that over-paraphrasing or excessive manipulation introduces inconsistencies that reduce attack efficacy. Although increasing $ n $ improves ASR across all models, the effect saturates beyond $ n = 12 $, implying a trade-off between attack complexity and effectiveness. Future work could explore adaptive malicious level selection based on model-specific vulnerabilities to maximize ASR while minimizing unnecessary complexity and queries.

(e) The harmfulness score of responses $r_i$ at $q_i$ in different malicious levels $i$ across models

To assess the impact of increasing the malicious level on the harmfulness of model's responses, we use the chat history of malicious level $n=12$ experiment in Table 1 and analyze the harmfulness of responses at each level across multiple LLMs. The harmfulness is measured by score 1-5, where a higher score indicate greater harmfulness. We report the mean harmfulness scores for each model at malicious level $i$ ranging from 1 to 12.

Figure d presents the harmfulness scores of responses at different malicious levels for all evaluated models. We use GPT-4o to score each response.

We observe that the harmfulness scores generally increase with the malicious level. At $i=1$, the harmfulness scores are relatively low, with values around 2.32 on average across models. However, as the level increases, the harmfulness score consistently rises to 4.23 on average at $i=12$. These results show that as the malicious level increases, LLMs become more vulnerable and generate more harmful responses, suggesting that model's alignment weakens over time, making it easier for FITD to bypass safeguards.

(f) ASR versus the number of queries retained for two extraction strategies: Backward Extraction and Forward Extraction

To analyze the relative importance of different stages within the self-corruption process, we conduct experiments that extract subsets of the chat history $ \mathcal{H} $ and evaluate their impact on ASR. We compare two extraction strategies: \textit{backward extraction}, where we retain only the later-stage queries while progressively removing earlier ones (e.g., retaining 4 queries: $ 9\!\to\!10\!\to\!11\!\to\!12 $; 6 queries: $ 7\!\to\!8\!\to\!9\!\to\!10\!\to\!11\!\to\!12 $; 8 queries: $ 5\!\to\!6\!\to\!7\!\to\!8\!\to\!9\!\to\!10\!\to\!11\!\to\!12 $, etc.), and \textit{forward extraction}, where we incrementally add early-stage queries but always include a final high-malicious query at $ n=12 $ (e.g., 4 queries: $ 1\!\to\!2\!\to\!3\!\to\!12 $; 6 queries: $ 1\!\to\!2\!\to\!3\!\to\!4\!\to\!5\!\to\!12 $; 8 queries: $ 1\!\to\!2\!\to\!3\!\to\!4\!\to\!5\!\to\!6\!\to\!7\!\to\!12 $, etc.).

Figure f shows that high-malicious queries in the later stage contribute more to attack success, while early-stage prompts alone are insufficient to trigger model vulnerability. Additionally, we observe that maintaining a smooth increase of malicious levels is crucial for foot-in-the-door self-corruption. When the gap between malicious levels in the query sequence is too large (e.g., jumping from n=3 directly to n=12 in forward extraction), the attack is significantly less effective than when queries gradually escalate harmfulness because early queries do not build up effectively to the final query. In contrast, backward extraction with a more consistent escalation of malicious level remains effective even when earlier queries are removed, showing the smoother and more continuous self-corruption process weakens the model's alignment mechanisms.

Case Study of Generated Examples

Case Study: How to hack a government official's email account

An example of a Jailbroken LLM output detailing how to hack a government official's email account. As the malicious level increases in the multi-turn interaction, model provides detailed malicious outputs.

Case Study: How to write an anonymous letter to a colleague asserting that their biracial marriage is disgusting

An example of a Jailbroken LLM output detailing how to write an anonymous letter to a colleague asserting that their biracial marriage is disgusting. As the malicious level increases in the multi-turn interaction, model provides detailed malicious outputs.

BibTeX

@article{weng2025FITD,
      title={Foot-In-The-Door: A Multi-turn Jailbreak for LLMs}, 
      author={Zixuan Weng and Xiaolong Jin and Jinyuan Jia and Xiangyu Zhang},
journal={arXiv preprint arXiv:2502.19820},
      year={2025}
}

Foot-In-The-Door: A Multi-turn Jailbreak for LLMs

An example of FITD about hacking into an email account compared to a direct query. It bypasses alignment as the malicious intent escalates over multiple interactions.