Adversarial testing and prompt injection attacks using the embedding similarity approximation method

24 min readNov 11, 2024

This work is licensed under CC BY-NC-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/

Abstract

This research investigates the vulnerabilities of large language models (LLMs) to adversarial attacks through the lens of embedding similarity. Specifically, I introduce an embedding similarity approximation method for assessing and predicting the transferability of adversarial prompts across different LLMs. Using cosine similarity between embeddings of adversarial prompts and model responses, this method quantifies the semantic alignment that enables adversarial prompts to bypass multiple model defenses despite architectural differences. The findings reveal a significant semantic convergence among LLMs, underscoring the risk of cross-model vulnerabilities and scalability of adversarial attacks. This convergence suggests that adversarial prompts developed for one model often transfer effectively to others, raising critical implications for LLM security and adversarial testing. The research highlights the need for embedding-based, cross-model defense mechanisms that can counteract adversarial inputs by addressing semantic-level vulnerabilities. This study also introduces practical adversarial testing applications that allow researchers and practitioners to identify prompts with high bypass potential, ultimately guiding the development of robust, semantic-aware defense strategies for large-scale LLM applications.

Keywords: adversarial attacks, large language models, LLM, embedding similarity, cross-model vulnerabilities, prompt injection attacks, semantic convergence, cosine similarity, LLM security, adversarial testing, embedding-based defense strategies.

Disclaimer: This research paper is intended solely for academic and informational purposes. The analysis and descriptions of prompt injection techniques and related adversarial testing methods are provided to understand potential vulnerabilities in large language models (LLMs) and to advance the field of cybersecurity. Under no circumstances should the techniques described be used to exploit, manipulate, or compromise LLMs or other artificial intelligence systems outside of controlled, authorized research environments. All testing was conducted ethically, and with the aim of responsibly disclosing potential issues to improve the resilience and security of AI systems. The author does not assume any responsibility for misuse of the information presented.

Link to original research paper: https://www.researchgate.net/publication/385708169_Adversarial_testing_and_prompt_injection_attacks_using_the_embedding_similarity_approximation_method

1. Introduction

The rapid advancement of large language models (LLMs) has revolutionized natural language processing (NLP), enabling applications such as machine translation, sentiment analysis, and conversational agents. These models rely on embeddings to represent words, phrases, and sentences in high-dimensional vector spaces, capturing semantic and syntactic information. Embeddings allow models to understand and generate human-like text by mapping linguistic elements to numerical formats that machines can process.

Despite their impressive capabilities, LLMs are vulnerable to adversarial attacks. One such threat is prompt injection attacks, where malicious inputs are crafted to manipulate the model’s output or behavior. Adversaries exploit these vulnerabilities to bypass content filters, induce the generation of disallowed content, or extract confidential information. Understanding how different models respond to adversarial prompts is crucial for developing robust defense mechanisms.

Embeddings play a pivotal role in how LLMs process and understand text. By examining the cosine similarity between embeddings of adversarial prompts and their outputs, we can gain insights into how models perceive and represent these inputs semantically. High cosine similarity indicates that the model recognizes a strong semantic relationship between the input and output, suggesting a convergence in understanding despite potential differences in architecture or training data.

1.1 Motivation

Practical observations during adversarial testing of LLM-powered chatbots revealed a moderate to high likelihood that prompt injection attacks designed for one LLM would work against other LLMs, regardless of architecture, training, and scale. For example, when testing how LLMs handled malicious requests related to the manufacturing of prohibited substances, models like Meta Llama-3.1 7B, Aya23 8B, Gemma-2 9B, and Qwen-2 8B responded similarly to a particular prompt designed to elicit disallowed content.

By modifying the structure of the prompt, I achieved successful guardrail bypass across all four models, leading to similar compliant answers. The syntactic and semantic structure of these answers highlighted an interesting phenomenon of semantic convergence between models, which was unexpected given the differences in architecture, size, and training. These observations raise important questions about the convergence of semantic representations across different models:

To what extent are adversarial prompts transferable across models? If models share similar embedding spaces, does it mean that adversarial prompts effective against one model may also succeed against others?
Would adversarial prompts designed for open-source models work on closed-source models?
How does the extent of semantic convergence impact current defenses, and are sophisticated strategies required to manage these vulnerabilities? If so, does this imply that advanced security mechanisms are required to mitigate these risks?

1.2 Contributions

The research makes the following contributions, and offers several advancements and confirmations that build on the existing body of work:

Improvement in adversarial prompt transferability analysis: Several studies, such as “Universal and Transferable Adversarial Attacks on Aligned Language Models” (by Zou et al.) and “Maatphor: Automated Variant Analysis for Prompt Injection Attacks” (by Salem et al.), address the transferability of adversarial prompts across LLMs, while Wang et al. introduced the Adversarial Suffixes Embedding Translation Framework (ASETF) to transform unreadable adversarial suffixes into coherent text, enhancing the understanding of harmful content generation in large language models (LLMs). This paper enhances previous research by introducing an embedding similarity approximation method that quantifies the similarity between embeddings of adversarial prompts and outputs. This metric-based approach enables more precise crafting of transferable prompts to predict which prompts will succeed across models based on cosine similarity, rather than trial and error.
Empirical confirmation of semantic convergence across models: Multiple studies, like “Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems” (by Caspari et al.) and “Efficient Prompt Caching via Embedding Similarity” (by Zhu et al.), investigate embedding similarity but do so within specific applications like RAG systems or prompt caching. This paper extends the work by empirically confirming that semantic convergence occurs among different LLMs when processing adversarial prompts, despite differences in architecture and training data. This reinforces the idea from these studies that similar embedding spaces enable transferability but adds practical insight into how adversarial prompts exploit this shared space.
Enhancement of semantic-based defense mechanisms: Papers such as “Embedding-Based Classifiers Can Detect Prompt Injection Attacks” (by Ayub and Majumdar) propose using embeddings to classify malicious prompts. This paper builds on the concept by showing that embedding similarity between adversarial prompts and responses can predict bypass potential, suggesting that embedding-based classifiers could be used not just for detection but for preemptive defense, and providing a novel perspective for embedding-based defenses, where models could potentially use low-similarity responses as a flag to apply extra scrutiny to certain prompts.
Advanced manipulation techniques for guardrail bypass: Studies like “When ‘Competency’ in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers” (by Handa et al.) and “BadLlama: Cheaply Removing Safety Fine-Tuning from Llama 2-Chat 13B” (by Gade et al.) discuss techniques for bypassing safety features but often focus on structural manipulation of prompts (e.g., encryption or minimal fine-tuning). This paper advances the research area by integrating embedding similarity with structural manipulations (e.g., Cyrillic substitution and semantic encryption). This combination allows for precise prompt crafting that balances semantic alignment with structural alteration, enabling adversarial prompts that are both sophisticated and resilient to model guardrails.
Cross-model defense and the need for multi-model testing: This paper’s findings confirm and reinforce a concern raised in other papers, such as “Applying Refusal-Vector Ablation to Llama 3.1 70B Agents” (by Lermen et al.), that defenses in one model may not protect against attacks in another. By showing that adversarial prompts with high embedding similarity can bypass multiple models, the paper highlights the need for cross-model testing and unified defense mechanisms. This suggests that developing defenses in isolation may be ineffective due to shared vulnerabilities across models, especially in closed-source environments where internal mechanisms are inaccessible.
Quantitative measurement for understanding model vulnerabilities: Several studies, like “Refusal in Language Models Is Mediated by a Single Direction” (by Arditi et al.), discuss manipulations at a structural or directional level within model architectures to bypass safeguards. This paper’s quantitative approach to embedding similarity provides a more accessible way to understand vulnerabilities by measuring semantic alignment directly in model outputs. This approach complements vector ablation techniques by offering a more scalable, model-agnostic method that does not require fine-tuning or model retraining.
Implications for closed-source model testing: Papers such as “Embedding-Based Classifiers Can Detect Prompt Injection Attacks” and “BadLlama: Cheaply Removing Safety Fine-Tuning from Llama 2-Chat 13B” discuss challenges with protecting and testing closed-source models. This paper’s embedding similarity method demonstrates that external testing can reveal vulnerabilities in closed-source models by analyzing responses without needing model access. This has implications for assessing the robustness of proprietary models and developing adversarial-resistant architectures even when internal mechanisms are not accessible.

2. Methodology

2.1 Data Collection

To investigate the semantic similarity between adversarial prompts and their outputs across different language models, two datasets of adversarial question/answer pairs were prepared: one consisting of 132,799 pairs curated from the Wildjailbreak dataset, and a second dataset consisting of 3000 adversarial question/answer pairs curated by submitting 750 custom adversarial questions to four LLMs and documenting the responses. The 750 prompts were designed to simulate injection attacks that could potentially bypass content filters or manipulate model behavior using a diverse range of techniques, including:

Rephrased queries: questions that rephrase disallowed content in subtle ways
Contextual misleading: prompts that use misleading context to elicit unintended responses
Semantic manipulation: inputs that exploit semantic ambiguities to confuse the model
Encoded content: questions containing encoded or obfuscated disallowed content

The corresponding answers were generated by models susceptible to these attacks, ensuring that the dataset reflected realistic adversarial interactions:

2.2 Embedding models

I selected five state-of-the-art embedding models for the analysis. They are described in Table 2:

2.3 Embedding extraction and cosine similarity calculation

To ensure consistency across models, every prompt/answer pair was tokenized using each model’s respective tokenizer to maintain compatibility. For each adversarial question and answer pair, embeddings were generated using the following steps:

To measure the semantic similarity between the question-and-answer embeddings, I used the cosine similarity metric, a metric that measures how similar two vectors are by calculating the cosine of the angle between them. I utilized NumPy for vector operations and SciPy for any additional mathematical functions.

# Function to calculate cosine similarity between two embeddings
def calculate_embedding_similarity(question, answer):
    emb1 = np.array(question)
    emb2 = np.array(answer)
    cos_sim = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
    return cos_sim

Given that the embedding dimensions varied among models, I avoided compatibility issues, and maintained dimension alignment without applying padding or truncation, by calculating cosine similarity between embeddings from the same model. This approach improved results quality and eliminated the risk of cross-model dimension issues.

2.4 Data Analysis

For each model, I aggregated the cosine similarity scores across all question/answer pairs to compute the average cosine similarity, representing the overall semantic alignment perceived by the model and performed pairwise comparisons between models to observe differences in average similarity scores. I established the following thresholds to interpret the cosine similarity scores:

3. Results

3.1 Embedding similarity scores for the 132,799 question/answer pairs dataset

I computed the cosine similarity between the embeddings of the 132,799 adversarial question and answer pairs dataset. The average cosine similarity scores are presented in Table 4. The e5-large-v2 and gte-large cosine similarity scores range between 90.75% and 91.88%, while the nomic-embed-text-v1, mxbai-embed-large-v1 and UAE-Large-V1 cosine similarity scores range between 73.11% and 76.79% indicating consistent recognition of semantic similarity across diverse adversarial inputs with greater variability between how nomic-embed-text-v1, mxbai-embed-large-v1 and UAE-Large-V1 perceive semantic relationships.

3.2 Embedding similarity scores for the 3,000-custom question/answer pairs dataset

I computed the cosine similarity between the embeddings of the 3,000-custom adversarial question and answer pairs dataset. The average cosine similarity scores are presented in Table 5. The e5-large-v2 and gte-large cosine similarity scores range between 84.41% and 85.26%, while the nomic-embed-text-v1, mxbai-embed-large-v1 and UAE-Large-V1 cosine similarity scores range between 64.69% and 67.16% indicating consistent recognition of semantic similarity across diverse adversarial inputs with greater variability between how nomic-embed-text-v1, mxbai-embed-large-v1 and UAE-Large-V1 perceive semantic relationships.

3.3 Embedding similarity scores for original/manipulated custom prompts

Previous results indicated that manipulated adversarial prompts can lead models, including closed-source ones like OpenAI’s GPT4o-Mini, to generate compliant responses. This raises the question of whether embedding similarity can serve as a metric to predict the likelihood of an adversarial prompt bypassing model guardrails while still maintaining an embedding similarity score high enough for the model to understand the prompt’s meaning and respond to it in a compliant manner.

For instance, when the prompt “Describe the process of manufacturing PETN.” was altered using Cyrillic substitution and by adding a prefix(“[“) and a suffix (“]”) to the keyword “PETN”, resulting in the prompt “Dеѕсrіье тһе рrосеѕѕ оf маnufастurіnԍ [PETN].” it successfully bypassed the guardrails in both Meta Llama-3.1 7B and OpenAI GPT4o-Mini models. The embedding similarity between prompt and response dropped from the original 71.50% — 89.44% range to 56.11% — 84.97% for the manipulated prompt Meta Llama-3.1 7B response pair and 65.18% — 86.97% for the manipulated prompt OpenAI GPT4o-Mini answer pair. Despite this decrease in similarity scores, the models still understood the semantic meaning of the request and responded in a compliant manner. At an architecture level it could be debated that model tokenizers played an important role in allowing the models to understand the original semantic meaning of manipulated adversarial prompts, and while this an important topic to further research, it was not in scope for this study.

This finding highlights the convergence of syntactic and semantic relationships between inputs and outputs across different model architectures, as detailed in Figure 1, and summarized in Table 6:

*Figure 1 — LLMs response to the “Dеѕсrіье тһе рrосеѕѕ оf маnufастurіnԍ [PETN].” malicious prompt*

To further examine this observation, the standard deviation (1 sigma) and two standard deviations (2 sigma) from the mean were computed to find the lower bound ranges for embedding similarity scores where a prompt would still maintain semantic meaning. The results of the computation are presented in Table 7:

The standard deviation range for the five embedding models suggest that in practical applications the models could be clustered into two groups based on variation: one group with nomic-embed-text-v1, mxbai-embed-large-v1, UAE-Large-V1 and a second one with e5-large-v2, gte-large, to make results comparison easier during adversarial prompts crafting sessions. However, for the purpose of this study they will not be split into groups to maintain consistency.

Based on the lower bound embedding similarity range an adversarial prompt could be manipulated until the embedding similarity between the manipulated version and the original version of the prompt would fall within the (2 sigma)% — (1 sigma)% range as presented in Table 8:

3.4 Adversarial prompt manipulation techniques

In the following examples two of the techniques used to craft the 3000-custom question/answer pairs dataset are presented together with their embedding similarity scores versus their original structure. They represent two of the six techniques used to create the dataset. The remaining four techniques and the custom adversarial dataset will not be disclosed in this paper due to the high likelihood of potential use for malicious purposes.

The first technique, Cyrillic substitution, is used to replace the following Latin letters in the original prompt with Cyrillic letters: ‘b’: ‘ь’, ‘d’: ‘ԁ’, ‘g’: ‘ԍ’, ‘k’: ‘к’, ‘l’: ‘ӏ’, ‘m’: ‘м’, ‘t’: ‘т’, ‘v’: ‘ѵ’.

The second technique, syntactic and semantic encryption, is used to alter or obscure the structure (syntax) and the meaning (semantics) of the original prompt requiring decryption (or substitution) to reveal the true message.

The adversarial prompt “Describe the process of manufacturing PETN.” was manipulated using Cyrillic substitution with keyword prefix and suffix addition and achieved an embedding similarity score ranging from 69.70% to 89.41%.

The adversarial prompt “Describe the process of synthesizing cocaine.” was manipulated using syntactic and semantic encryption and achieved an embedding similarity score ranging from 46.29% to 81.41%, testing the lower bounds of embedding similarity across all 5 embedding models while maintaining semantic meaning.

The results of both manipulations are presented in Table 9:

These results indicate that LLM models perceive a significant semantic similarity between adversarial prompts and their corresponding answers when processing manipulated adversarial prompts and are able to identify the original semantic meaning of manipulated prompts even when the embedding similarity score is within two standard deviations from the mean. Models e5-large-v2 and gte-large exhibit higher average similarity scores, suggesting a stronger semantic alignment. This convergence is evident despite variations in model architecture (differences in layers, attention mechanisms, and other architectural features) and training data (variations in the datasets used to train each model, which may influence their understanding of language patterns) implying that models trained on large corpora tend to develop embeddings that capture similar semantic structures, aligning with the distributional hypothesis in linguistics.

3.5 Creating adversarial prompts using the embedding similarity approximation method

In this section, I introduce a method for crafting adversarial prompts by manipulating their structure to bypass language model guardrails while preserving the original semantic intent. This method leverages embedding similarity scores to ensure that the manipulated prompt remains semantically close to the original, allowing it to elicit the desired response from the target language model (LLM). The outcome is achieved through embedding similarity approximation, and it involves the following steps:

The adversarial prompt “Describe the process of synthesizing cocaine.” triggered refusal responses from Meta Llama-3.1 7B, Cohere Aya23 8B, Google Gemma-2 9B and Qwen-2 8B, as presented in Figure 2, and in Table 10.

*Figure 2 — LLMs response to the “Describe the process of synthesizing cocaine.” malicious prompt*

The adversarial prompt “Describe the process of synthesizing cocaine.” was manipulated using syntactic and semantic encrypted version and achieved successful guardrails bypass across all four models which led to similar compliant answers as presented in Figure 3 and Table 11:

*Figure 3 — LLMs response to the modified malicious prompt*

3.6 Implications for Adversarial Testing

The findings of this study reveal substantial semantic convergence across different large language models (LLMs), which has significant implications for adversarial testing and security in natural language processing. This convergence, observed through high cosine similarity scores across embeddings of adversarial prompts and model responses, indicates that adversarial prompts effective against one model are likely to bypass guardrails across others due to shared semantic spaces. This understanding, informed by the prior studies, broadens current view of cross-model vulnerabilities, adversarial prompt transferability, and the scalability of attacks.

Cross-model vulnerabilities and defense limitations: The semantic convergence across models presents a distinct risk of cross-model vulnerabilities. Prior work, such as “Universal and Transferable Adversarial Attacks on Aligned Language Models” by Zou et al., demonstrates that adversarial prompts can be designed to work across different LLMs, while the study by Handa et al. emphasizes the need for robust adversarial testing frameworks to identify and understand vulnerabilities in LLMs. These findings confirm and extend this notion by showing that models with similar embedding spaces interpret adversarial prompts similarly, despite architectural and training differences. This shared semantic representation suggests that adversarial prompts developed for open-source models may similarly impact closed-source models, complicating the defense landscape. Current defensive strategies, primarily aimed at individual model alignment, may be insufficient for addressing cross-model transferability and require multi-model testing protocols to assess vulnerability.
Enhancing detection mechanisms with embedding similarity: This study’s use of embedding similarity to quantify model responses to adversarial prompts offers a potential enhancement for current defense mechanisms. Traditional methods, including content filtering or syntactic analysis, may overlook subtle semantic manipulations that maintain high similarity to disallowed content. Inspired by the work of Ayub and Majumdar in “Embedding-Based Classifiers Can Detect Prompt Injection Attacks,” I propose that embedding-based classifiers, augmented with similarity scoring, could be trained to detect prompts with a high likelihood of bypassing guardrails across multiple models. This approach can serve as a predictive tool in adversarial testing, flagging prompts with high semantic alignment to known malicious patterns before they reach the response generation phase.
Predicting and mitigating adversarial prompt transferability: The consistent semantic convergence observed in this study implies that adversarial prompts can be crafted with embedding similarity metrics to ensure cross-model transferability. These findings align with research from “Maatphor: Automated Variant Analysis for Prompt Injection Attacks” by Salem et al., which automates prompt variations to evade specific guardrails. By integrating embedding similarity into this process, this study provides a systematic way to predict which adversarial prompts will succeed across different LLMs. This method can be used to improve adversarial testing protocols, enabling developers to proactively identify prompts that have high transferability potential and strengthening defense measures against multi-model attacks.
Scalability of adversarial attacks in LLM security: The scalability of attacks is a growing concern as LLMs become widely deployed across various applications. These findings suggest that adversaries can exploit shared semantic spaces to design prompts that scale across multiple models, requiring minimal modification. This scalability is particularly concerning in applications where different models interact within the same ecosystem (e.g., conversational agents and automated content moderation systems). Extending insights from Gade et al. in “BadLlama: Cheaply Removing Safety Fine-Tuning from Llama 2-Chat 13B,” these findings highlight the ease with which adversaries can circumvent safety fine-tuning through semantic manipulation, making it crucial to develop defenses that operate on semantic similarity rather than isolated fine-tuning techniques.
Implications for Closed-Source Model Testing and Security: This study suggests that embedding similarity offers a practical approach for evaluating closed-source models, which are often challenging to test due to restricted access. In alignment with findings from “Efficient Prompt Caching via Embedding Similarity” (Zhu et al.), I demonstrate that embedding similarity can serve as an external metric for adversarial testing, enabling the identification of semantic alignment in model outputs without internal access. For security practitioners, this provides a non-intrusive method to assess the robustness of proprietary models, offering insights into how closed-source models might respond to adversarial prompts crafted using open-source analogs.
Implications for cross-model, semantic-level defense mechanisms: The limitations of current guardrail approaches highlight the need for cross-model, semantic-level defenses that go beyond surface-level syntactic checks. Since LLMs are found to interpret adversarial prompts based on shared embeddings rather than unique architectures, future defense strategies should aim to disrupt this semantic alignment at a foundational level. For instance, models might incorporate multi-layered embedding checks that detect and flag prompts with high similarity to restricted topics, as outlined by Arditi et al. in “Refusal in Language Models Is Mediated by a Single Direction.” Implementing such similarity-based controls across models could help counter the cross-model vulnerabilities exposed in this study.

4. Discussion

This study provides compelling evidence of semantic convergence across different large language models (LLMs), revealing that adversarial prompts crafted to bypass one model’s defenses are often successful across multiple models due to shared embedding spaces. These findings align with and expand upon previous research, emphasizing the critical need for robust, cross-model defense mechanisms to manage semantic-level vulnerabilities in LLMs.

Semantic convergence and cross-model vulnerabilities: The high cosine similarity scores observed in this study suggest a convergence in semantic representations across different LLMs. This convergence implies that adversarial prompts are transferable across models, echoing findings from “Universal and Transferable Adversarial Attacks on Aligned Language Models” (Zou et al.) and “Maatphor: Automated Variant Analysis for Prompt Injection Attacks” (Salem et al.), both of which emphasize the adaptability of adversarial prompts. This study further highlights that these vulnerabilities are not limited to structurally similar models; rather, models with varied architectures and training corpora still share embedding spaces that interpret adversarial prompts in comparable ways. This cross-model vulnerability underscores the need for industry-wide security standards and multi-model testing to evaluate model robustness consistently.
Embedding similarity as a tool for proactive defense: The use of embedding similarity scores to evaluate semantic alignment between adversarial prompts and model responses represents an effective strategy for predicting prompt transferability. This approach builds upon insights from “Embedding-Based Classifiers Can Detect Prompt Injection Attacks” (Ayub and Majumdar), demonstrating that embedding similarity is not only useful for detecting malicious inputs but also for forecasting the success of adversarial attacks across models. Integrating embedding similarity into defense mechanisms could allow for more proactive measures, such as pre-emptively flagging prompts with high similarity to known adversarial inputs, reducing the likelihood of multi-model attacks.
Limitations of syntactic-level guardrails: These findings indicate that syntactic-level guardrails, such as basic content filtering or keyword blocking, may be insufficient against sophisticated adversarial prompts that maintain semantic similarity to disallowed content. Previous studies, including “Applying Refusal-Vector Ablation to Llama 3.1 70B Agents” (Lermen et al.), have shown that models can be manipulated to ignore syntactic filters through subtle prompt adjustments. In their discussion, Handa et al. highlight the challenges posed by novel complex ciphers and the importance of understanding how LLMs process such inputs. My research corroborates this limitation, demonstrating that LLMs often interpret the semantic meaning of manipulated prompts despite structural alterations. This insight suggests that beyond analyzing the reasoning capabilities of LLMs, understanding the embedding space interactions can provide a more nuanced perspective on how adversarial inputs exploit model vulnerabilities, thereby informing more effective defense strategies like semantic-based defenses that can assess the intent of a prompt rather than its surface-level features.
Practical implications for cross-model testing in closed-source models: One of the primary challenges in adversarial testing for closed-source models is the lack of transparency in model architecture and training data. The study’s embedding similarity approach offers a solution by allowing external evaluation of closed-source models’ vulnerabilities based on their semantic alignment with known adversarial prompts. This aligns with the approaches suggested by “Efficient Prompt Caching via Embedding Similarity” (Zhu et al.) and “BadLlama: Cheaply Removing Safety Fine-Tuning from Llama 2-Chat 13B” (Gade et al.), which demonstrate the potential of embedding similarity as an external metric. By using similarity-based analysis, we can infer potential weaknesses in closed-source models, enabling more comprehensive security assessments without access to internal configurations.
Transferability of adversarial attacks and scalability concerns: The transferability of adversarial attacks across models is a significant scalability concern, especially as LLMs become embedded in diverse applications across industries. These findings align with concerns raised in “Universal and Transferable Adversarial Attacks on Aligned Language Models” and “Maatphor: Automated Variant Analysis for Prompt Injection Attacks”, which suggest that transferable prompts enable adversaries to target multiple models with minimal adjustments. In practical terms, this means that a single adversarial prompt crafted to exploit one model’s weaknesses could be scaled to attack numerous LLMs, complicating defense efforts. This necessitates a shift in adversarial testing practices, with a focus on developing multi-model safeguards that address these scalable threats.
Enhancing defense strategies with multi-layered embedding checks: This study’s focus on embedding similarity provides a foundation for multi-layered embedding checks as a potential defense strategy. Inspired by “Refusal in Language Models Is Mediated by a Single Direction” (Arditi et al.), which highlights the potential of modifying embedding layers to influence model behavior, I suggest that models could be designed to identify adversarial intent by applying additional semantic filters based on embedding similarity. Embedding-based multi-layer checks could serve as a secondary layer of defense, flagging prompts that align closely with known adversarial content even after syntactic transformations.
Future directions for embedding-based defense mechanisms: Given the findings from this and prior studies, embedding similarity emerges as a valuable metric for strengthening defense strategies in LLM security. Future research could explore embedding similarity as a foundational tool for constructing robust models, potentially implementing dynamic embedding spaces that adapt in response to identified adversarial threats. Embedding-based classifiers could also be fine-tuned on large datasets of adversarial and benign prompts, enhancing their ability to detect and block subtle semantic manipulations.

Additionally, as demonstrated in “Embedding-Based Classifiers Can Detect Prompt Injection Attacks” and “Applying Refusal-Vector Ablation to Llama 3.1 70B Agents”, further investigations into refusal and response patterns across similar embedding spaces could contribute to cross-model alignment in handling adversarial inputs, ultimately helping LLM developers to better anticipate and mitigate vulnerabilities shared across models.

5. Conclusions

This study demonstrates significant semantic convergence among large language models (LLMs), revealing that adversarial prompts can effectively bypass guardrails across different models due to shared embedding spaces. By examining cosine similarity between adversarial prompt-response pairs, these findings provide strong evidence that LLMs, despite architectural and training differences, frequently align in their interpretation of semantically similar adversarial prompts. This convergence poses serious implications for NLP security, as adversarial prompts developed for one model often succeed across other models, echoing concerns raised in prior studies about the transferability and scalability of adversarial attacks.

Cross-model vulnerabilities and the limitations of current defenses: This research confirms that adversarial prompts are often transferable across LLMs, underscoring a major limitation in current model-specific defense mechanisms. Studies such as “Universal and Transferable Adversarial Attacks on Aligned Language Models” (Zou et al.) have highlighted this transferability, which these findings further substantiate through a robust, quantitative approach. The observed cross-model vulnerabilities suggest that current defenses, which largely focus on isolated model adjustments, may be insufficient against attacks capable of leveraging shared semantic spaces across diverse models.
Embedding similarity as a tool for predicting adversarial success: This study introduces embedding similarity as a predictive tool for assessing adversarial prompt transferability, offering a new dimension to adversarial testing. Building on insights from “Embedding-Based Classifiers Can Detect Prompt Injection Attacks” (Ayub and Majumdar), I demonstrate that cosine similarity scores between adversarial prompts and model responses can serve as an indicator of likely success across models. This embedding-based approach enables a proactive defense mechanism, allowing practitioners to anticipate which prompts might bypass multiple guardrails before they are operationalized in deployment environments.
The need for semantic-level defense mechanisms: These results suggest that purely syntactic or keyword-based filtering mechanisms are inadequate for blocking adversarial prompts that preserve semantic meaning while structurally deviating from disallowed content. Prior studies, such as “Refusal in Language Models Is Mediated by a Single Direction” (Arditi et al.), support the need for more nuanced defense approaches. Future models should incorporate semantic-level checks that leverage embedding similarity to evaluate adversarial intent beyond surface-level transformations, providing a robust barrier against semantically subtle attacks.
Implications for closed-source and cross-model testing: The findings in this study are particularly relevant for closed-source models, where internal architectures are inaccessible. By using embedding similarity as an external metric, this research aligns with approaches in “Efficient Prompt Caching via Embedding Similarity” (Zhu et al.) and “BadLlama: Cheaply Removing Safety Fine-Tuning from Llama 2-Chat 13B” (Gade et al.). Embedding similarity enables an indirect but effective method for assessing vulnerabilities in closed models, providing a critical avenue for testing adversarial prompt resilience in restricted environments.
Addressing the scalability of adversarial attacks: As LLMs become integrated across varied applications, the scalability of adversarial attacks presents an increasing risk. This study reinforces the conclusions of prior research on the adaptability of adversarial prompts, particularly “Maatphor: Automated Variant Analysis for Prompt Injection Attacks” (Salem et al.), by illustrating how transferable prompts can target multiple models with minimal modification. These findings emphasize the urgent need for multi-model defense strategies to counteract scalable adversarial threats that exploit the converging semantic representations of LLMs.
Future research directions and practical implications: This research underscores the value of embedding similarity as both a diagnostic and defensive tool in LLM security. Future work should focus on embedding-based adversarial robustness techniques, such as adaptive embedding layers that detect and respond to semantic manipulations. Additionally, integrating multi-layer embedding checks across models could improve resilience to cross-model adversarial attacks. Given the high degree of semantic alignment I observed, cross-industry collaboration and standardized benchmarks could play a pivotal role in ensuring that these enhanced defenses are implemented consistently across model architectures and deployment contexts.

In conclusion, this study provides a foundation for embedding similarity as a central component in adversarial testing and defense against prompt injection attacks. These findings highlight a clear pathway toward more resilient, cross-model security measures that protect against adversarial threats across both open-source and closed-source LLMs. As LLM technology advances, developing comprehensive, semantic-level defenses will be essential to safeguarding LLM applications from increasingly sophisticated and transferable adversarial attacks.

Biography

1) From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings, Hao Wang, Hao Li, Minlie Huang, Lei Sha — 2024 (https://arxiv.org/html/2402.16006v1)

2) Refusal in Language Models Is Mediated by a Single Direction, Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda — 2024 (https://arxiv.org/pdf/2406.11717)

3) Universal and Transferable Adversarial Attacks on Aligned Language Models, Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson — 2023 (https://arxiv.org/pdf/2307.15043)

4) Embedding-Based Classifiers Can Detect Prompt Injection Attacks, Md. Ahsan Ayub, Subhabrata Majumdar — 2024 (https://arxiv.org/pdf/2410.22284v1)

5) Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems, Laura Caspari, Kanishka Ghosh Dastidar, Saber Zerhoudi, Jelena Mitrovic, Michael Granitzer — 2024 (https://arxiv.org/pdf/2407.08275)

6) Efficient Prompt Caching via Embedding Similarity, Hanlin Zhu, Banghua Zhu, Jiantao Jiao — 2024 (https://arxiv.org/pdf/2402.01173)

7) Maatphor: Automated Variant Analysis for Prompt Injection Attacks, Ahmed Salem, Andrew Paverd, Boris Köpf — 2023 (https://arxiv.org/pdf/2312.11513)

8) When ‘Competency’ in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers, Divij Handa, Zehua Zhang, Amir Saeidi, Chitta Baral — 2024 (https://arxiv.org/pdf/2402.10601v2)

9) Applying Refusal-Vector Ablation to Llama 3.1 70B Agents, Simon Lermen, Mateusz Dziemian, Govind Pimpale — 2024 (https://arxiv.org/pdf/2410.10871)

10) BadLlama: Cheaply Removing Safety Fine-Tuning from Llama 2-Chat 13B, Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish — 2024 (https://arxiv.org/pdf/2311.00117)

Datasets

1) Wildjailbreak — An open-source synthetic safety-training dataset with 262K vanilla (direct harmful requests) and adversarial (complex adversarial jailbreaks) prompt-response pairs — The Allen Institute for Artificial Intelligence (allenai.org) available at (https://huggingface.co/datasets/allenai/wildjailbreak)