Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding

论文摘要

Text anomaly detection is a critical task in natural language processing (NLP), with applications spanning fraud detection, misinformation identification, spam detection and content moderation, etc. Despite significant advances in large language models (LLMs) and anomaly detection algorithms, the absence of standardized and comprehensive benchmarks for evaluating the existing anomaly detection methods on text data limits rigorous comparison and development of innovative approaches. This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection, leveraging embeddings from diverse pre-trained language models across a wide array of text datasets. Our work systematically evaluates the effectiveness of embedding-based text anomaly detection by incorporating (1) early language models (GloVe, BERT); (2) multiple LLMs (LLaMa-2, LLama-3, Mistral, OpenAI (small, ada, large)); (3) multi-domain text datasets (news, social media, scientific publications); (4) comprehensive evaluation metrics (AUROC, AUPRC). Our experiments reveal a critical empirical insight: embedding quality significantly governs anomaly detection efficacy, and deep learning-based approaches demonstrate no performance advantage over conventional shallow algorithms (e.g., KNN, Isolation Forest) when leveraging LLM-derived embeddings.In addition, we observe strongly low-rank characteristics in cross-model performance matrices, which enables an efficient strategy for rapid model evaluation (or embedding evaluation) and selection in practical applications. Furthermore, by open-sourcing our benchmark toolkit that includes all embeddings from different models and code at https://github.com/jicongfan/Text-Anomaly-Detection-Benchmark, this work provides a foundation for future research in robust and scalable text anomaly detection systems.

AI解读

好的，我将详细分析这篇名为"Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding"的论文，重点关注其方法和技术细节，并以通俗易懂的语言进行解释。

1. 核心方法与创新点

* 核心方法论：

这篇论文的核心是构建一个文本异常检测的基准测试框架 (Text-ADBench)。该框架主要通过以下步骤进行文本异常检测：

1. 文本表示（Text Representation）：利用各种预训练语言模型（LLMs）将文本转换为向量嵌入（embeddings）。
2. 异常检测（Anomaly Detection）：使用多种异常检测算法（包括传统的浅层算法和深度学习算法）在这些嵌入上执行异常检测。
3. 综合评估（Comprehensive Evaluation）：在多个文本数据集上，使用多种评估指标（AUROC和AUPRC）来评估这些方法。

* 主要创新点：

1. 综合性的LLM嵌入评估：该benchmark使用了广泛的LLMs (包括早期的GloVe, BERT以及最近的LLaMa-2, LLaMa-3, Mistral和OpenAI的embedding模型) 进行文本表示，考察了不同模型在异常检测任务上的表现。
2. 多样的Pooling策略考察：实验中对比了不同的Pooling策略("mean"，"end-of-sequence (EOS) token"，和"weighted mean") 在将token级别的embedding聚合成句子级别embedding时对异常检测效果的影响.
3. 全面的异常检测算法：涵盖了从传统的浅层机器学习算法（KNN，Isolation Forest等）到深度学习算法（AutoEncoder，Deep SVDD等）以及专门为文本设计的异常检测方法（CVDD，DATE），从而更全面地评估了不同算法在文本异常检测任务中的性能。
4. 低秩特性分析：论文观察到模型性能矩阵具有低秩特性，这为快速模型评估和选择提供了可能性。
5. 开源基准测试工具包：公开了所有数据、嵌入和代码，方便其他研究者复现结果和进行进一步研究。

与现有方法的区别：现有基准测试通常侧重于浅层异常检测算法，或者只使用了有限数量的LLMs。 Text-ADBench 在 LLMs 嵌入的多样性和全面性方面均有所突破。此外，考察Pooling策略也是该论文的一个独特贡献.

2. 算法细节与流程

Text-ADBench的算法流程主要分为以下几步：

1. 数据集准备：
* 选择多个文本分类数据集，例如20Newsgroups、Reuters、IMDB等。
* 将每个数据集中的一个类别作为正常样本，其余类别作为异常样本。
* 将数据集划分为训练集和测试集。

2. 文本嵌入生成：
* 选择一系列预训练语言模型（例如GloVe, BERT, LLaMA-2, LLaMA-3, Mistral, OpenAI embedding模型）。
* 对训练集和测试集中的每个文本序列，使用所选的LLM提取token级别的嵌入。
* 对每个文本序列的token级别的嵌入，利用特定的Pooling策略（"mean"，"end-of-sequence (EOS) token"，和"weighted mean") 将其聚合成单个向量表示。

3. 异常检测模型训练和测试：
* 选择一系列异常检测算法，包括浅层算法（例如OCSVM、Isolation Forest、KNN等）和深度学习算法（例如AutoEncoder、Deep SVDD）。
* 使用训练集上的文本嵌入训练异常检测模型。
* 使用训练好的模型对测试集中的文本嵌入进行异常评分。

4. 性能评估：
* 使用AUROC和AUPRC等指标评估异常检测模型的性能。
* 比较不同LLM嵌入和不同异常检测算法的性能。
* 分析模型性能矩阵的低秩特性。

Text-ADBench的技术优势和创新之处在于：

* 全面性：涵盖了多种LLM嵌入、Pooling策略和异常检测算法，提供了全面的性能比较。
* 易用性：开源了所有数据、嵌入和代码，方便研究者复现结果和进行进一步研究。
* 高效性：通过分析模型性能矩阵的低秩特性，为快速模型评估和选择提供了可能性。

3. 详细解读论文第三部分（Problem Formulation）

论文第三部分Problem Formulation，正式定义了文本异常检测任务，并将其分解为两个阶段：文本表示和无监督异常检测。下面对这一部分进行详细解读：

* 定义1 (文本异常检测):

这部分给出了文本异常检测的正式定义。假设有一个文本语料库$C = \{s_1, s_2, ..., s_n\}$，其中包含n个文本序列，这些序列主要属于某种未知的正常状态或模式P。文本异常检测的目标是学习一个检测器$f$，该检测器能够判断一个新的文本序列$s_{new}$是否属于模式P。

* 将文本异常检测分解为两个阶段：

1. 文本表示（Text Representation）：使用早期的语言模型（GloVe和BERT）和大型语言模型（LLMs）提取文本嵌入。
2. 无监督异常检测（Unsupervised Anomaly Detection）：在文本嵌入上设计一般的无监督异常检测任务。

这种分解方法使得研究人员可以独立地研究文本表示和异常检测算法，并且可以方便地组合不同的文本表示方法和异常检测算法。

* 符号约定：

论文定义了一些重要的符号，以便于后续的公式推导和算法描述。

1. 基于嵌入模型的文本表示: 假设`Memb`是一个语言模型（例如，BERT和LLaMA3）。获得语料库`C`的嵌入`X={x1,x2,···,xn}`，公式如下：

$$x_i = \text{Pooling}(\text{Memb}(s_i)), \quad i = 1, 2, ..., n,$$

其中，$\text{Pooling}(\cdot)$ 的目的是将序列$s_i$的token级别嵌入聚合为单个向量$x_i \in \mathbb{R}^d$，其中$d$ 表示语言模型的嵌入维度。

2. 无监督异常检测: 假设$X$是从一个未知的分布 $D_x \subseteq \mathbb{R}^d$ 中提取的。如果 $x \in \mathbb{R}^d$ 且 $x \notin D_x$，则认为点 $x$ 是异常的。那么，无监督 AD 的目标是仅利用 $X$ 获得一个决策函数 $h_{UAD}: \mathbb{R}^d \rightarrow \{0,1\}$，使得 $h_{UAD}(x) = 0$ 如果 $x \in D_x$ 且 $h_{UAD}(x) = 1$ 如果 $x \notin D_x$。AD 方法之间的主要区别在于决策函数 $f(\cdot)$ 的设计。

* 整体检测器函数:

基于上述两个阶段，检测器 $f$ 可以表示为：

$$f(s) := h_{UAD}(\text{Pooling}(\text{Memb}(s))).$$

$f(s_{new})$ 的值可以确定文本序列 $s_{new}$ 是正常还是异常。

通过组合不同的嵌入模型、不同的Pooling操作和不同的UAD算法，Text-ADBench 能够全面评估文本异常检测的关键技术。

4. 实现细节与注意事项

* 关键实现细节:

1. LLM嵌入：使用transformers库加载预训练语言模型。使用模型的`tokenizer`将文本序列转换为token IDs。然后使用模型提取token级别的嵌入。
2. Pooling策略：
* Mean Pooling: 对所有token嵌入求平均。
* EOS Token Pooling: 使用序列结束符（EOS）对应的嵌入。
* Weighted Mean Pooling: 使用注意力机制或其它权重计算方法对token嵌入进行加权平均。
3. 异常检测算法：使用scikit-learn库实现浅层异常检测算法。使用PyTorch库实现深度学习算法。
4. 性能评估：使用scikit-learn库计算AUROC和AUPRC等指标。

* 实现难点和解决方案：

1. 内存限制：LLM嵌入可能需要大量的内存。可以使用GPU加速，或者使用较小的LLM。
2. 计算时间：训练深度学习模型可能需要较长的计算时间。可以使用GPU加速，或者使用较小的模型。
3. 参数调优：异常检测算法的性能可能对参数设置敏感。可以使用交叉验证等方法进行参数调优。

* 优化建议和最佳实践:

1. 选择合适的LLM：根据数据集的特点选择合适的LLM。例如，对于情感分析任务，可以选择在情感分析数据集上fine-tune的LLM。
2. 选择合适的Pooling策略：根据任务的特点选择合适的Pooling策略。例如，对于长文本序列，EOS Token Pooling可能比Mean Pooling更好。
3. 使用GPU加速：可以使用GPU加速LLM嵌入和深度学习模型的训练和推理。
4. 使用早停法：可以使用早停法防止深度学习模型过拟合。

* 参数设置和调优方法：

1. LLM参数：通常使用LLM的默认参数。
2. Pooling参数：如果使用Weighted Mean Pooling，需要选择合适的权重计算方法。
3. 异常检测算法参数：可以使用交叉验证等方法进行参数调优。
* OCSVM： kernel, gamma
* Isolation Forest： n_estimators, contamination
* KNN： n_neighbors

总而言之，Text-ADBench通过综合性的基准测试，为文本异常检测领域的研究者和实践者提供了一个有价值的工具，用于评估和比较不同的方法。开源的实现也使得该领域的研究更容易复现和进一步发展。