模型评估指标：ROC曲线背后的统计学原理

November 8, 2016 · 18 min read

郭流芳

资深算法工程师

"没有测量就没有改进。在机器学习的世界里，如何评估模型的好坏，是比算法本身更重要的问题。" —— 2015年在老虎致远做模型评估时的深刻体会

开篇：一个医疗诊断的故事

想象你是一名医生，面前有两台不同的癌症检测设备：

设备A：总是说"有癌症"，准确率70%
设备B：很少说"有癌症"，但说了就对，准确率95%

哪个更好？单看准确率，设备B似乎更优秀。但如果癌症患者只占总人群的1%，设备A可能会漏掉所有的癌症患者，而设备B虽然准确率高，但可能错过了真正需要治疗的病人。

这个故事告诉我们：在不平衡数据集中，单一的评估指标往往会误导我们。

ROC曲线的诞生：从雷达到机器学习

历史背景：二战时期的雷达操作员

ROC（Receiver Operating Characteristic）曲线最初诞生于1940年代的二战期间。英国的雷达操作员需要在雷达屏幕上区分敌机和友机。问题是：

设置阈值太低：会把鸟群误认为敌机（假阳性）
设置阈值太高：会错过真正的敌机（假阴性）

工程师们需要一种方法来评估不同阈值下雷达系统的性能，ROC曲线应运而生。

从信号检测到医学诊断

1960年代，心理学家和医学研究者开始使用ROC分析来评估诊断测试的效果。到了1980年代，随着机器学习的兴起，ROC曲线成为了评估分类模型性能的标准工具。

混淆矩阵：一切的基础

在深入ROC曲线之前，我们必须理解混淆矩阵：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

class ModelEvaluator:
    def __init__(self):
        self.history = []
    
    def plot_confusion_matrix(self, y_true, y_pred, title="混淆矩阵"):
        """绘制混淆矩阵"""
        cm = confusion_matrix(y_true, y_pred)
        
        # 计算各种指标
        tn, fp, fn, tp = cm.ravel()
        
        # 计算指标
        accuracy = (tp + tn) / (tp + tn + fp + fn)
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
        f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # 可视化
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
        
        # 混淆矩阵热图
        im = ax1.imshow(cm, interpolation='nearest', cmap='Blues')
        ax1.figure.colorbar(im, ax=ax1)
        
        # 添加文本标注
        thresh = cm.max() / 2.
        for i in range(cm.shape[0]):
            for j in range(cm.shape[1]):
                ax1.text(j, i, format(cm[i, j], 'd'),
                        ha="center", va="center",
                        color="white" if cm[i, j] > thresh else "black")
        
        ax1.set_ylabel('实际标签')
        ax1.set_xlabel('预测标签')
        ax1.set_title(title)
        ax1.set_xticks([0, 1])
        ax1.set_yticks([0, 1])
        ax1.set_xticklabels(['负例', '正例'])
        ax1.set_yticklabels(['负例', '正例'])
        
        # 指标柱状图
        metrics = ['准确率', '精确率', '召回率', '特异性', 'F1分数']
        values = [accuracy, precision, recall, specificity, f1_score]
        
        colors = ['skyblue', 'lightcoral', 'lightgreen', 'gold', 'plum']
        bars = ax2.bar(metrics, values, color=colors, alpha=0.7)
        ax2.set_ylim(0, 1)
        ax2.set_ylabel('指标值')
        ax2.set_title('模型性能指标')
        
        # 添加数值标注
        for bar, value in zip(bars, values):
            height = bar.get_height()
            ax2.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                    f'{value:.3f}', ha='center', va='bottom')
        
        ax2.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # 打印详细指标
        print(f"=== 模型评估报告 ===")
        print(f"真正例 (TP): {tp}")
        print(f"假正例 (FP): {fp}")
        print(f"真负例 (TN): {tn}")
        print(f"假负例 (FN): {fn}")
        print(f"准确率 (Accuracy): {accuracy:.3f}")
        print(f"精确率 (Precision): {precision:.3f}")
        print(f"召回率 (Recall/Sensitivity): {recall:.3f}")
        print(f"特异性 (Specificity): {specificity:.3f}")
        print(f"F1分数: {f1_score:.3f}")
        
        return {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'specificity': specificity,
            'f1_score': f1_score,
            'confusion_matrix': cm
        }

# 创建示例数据
def create_imbalanced_dataset():
    """创建不平衡数据集来演示评估指标的重要性"""
    # 创建不平衡数据集（正例只占10%）
    X, y = make_classification(n_samples=1000, n_features=20, n_redundant=0,
                             n_informative=10, n_classes=2, weights=[0.9, 0.1],
                             random_state=42)
    
    print(f"数据集大小: {X.shape}")
    print(f"正例比例: {np.mean(y):.3f}")
    print(f"负例数量: {np.sum(y == 0)}")
    print(f"正例数量: {np.sum(y == 1)}")
    
    return X, y

# 演示混淆矩阵
evaluator = ModelEvaluator()
X, y = create_imbalanced_dataset()

# 创建两个简单的分类器来对比
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 分类器1：逻辑回归
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# 分类器2：决策树
dt = DecisionTreeClassifier(random_state=42, max_depth=5)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# 评估两个模型
print("=== 逻辑回归模型 ===")
lr_metrics = evaluator.plot_confusion_matrix(y_test, y_pred_lr, "逻辑回归 - 混淆矩阵")

print("\n=== 决策树模型 ===")
dt_metrics = evaluator.plot_confusion_matrix(y_test, y_pred_dt, "决策树 - 混淆矩阵")

ROC曲线：可视化模型性能

ROC曲线的核心概念

ROC曲线展示了在不同阈值下，模型的**真正例率（TPR）与假正例率（FPR）**的关系：

TPR (真正例率) = TP / (TP + FN)：也叫召回率或敏感性
FPR (假正例率) = FP / (FP + TN)：1 - 特异性

class ROCAnalyzer:
    def __init__(self):
        self.colors = ['blue', 'red', 'green', 'orange', 'purple']
    
    def plot_roc_curve(self, models_data, title="ROC曲线对比"):
        """
        绘制多个模型的ROC曲线
        models_data: [(model_name, y_true, y_scores), ...]
        """
        plt.figure(figsize=(12, 8))
        
        # 绘制对角线（随机分类器的性能）
        plt.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='随机分类器 (AUC = 0.5)')
        
        auc_scores = []
        
        for i, (name, y_true, y_scores) in enumerate(models_data):
            # 计算ROC曲线
            fpr, tpr, thresholds = roc_curve(y_true, y_scores)
            roc_auc = auc(fpr, tpr)
            auc_scores.append((name, roc_auc))
            
            # 绘制ROC曲线
            color = self.colors[i % len(self.colors)]
            plt.plot(fpr, tpr, color=color, linewidth=2,
                    label=f'{name} (AUC = {roc_auc:.3f})')
            
            # 标记最优点（最接近左上角的点）
            optimal_idx = np.argmax(tpr - fpr)
            optimal_threshold = thresholds[optimal_idx]
            plt.plot(fpr[optimal_idx], tpr[optimal_idx], color=color, 
                    marker='o', markersize=8, markerfacecolor='white', 
                    markeredgewidth=2)
            
            print(f"{name} - 最优阈值: {optimal_threshold:.3f}, "
                  f"TPR: {tpr[optimal_idx]:.3f}, FPR: {fpr[optimal_idx]:.3f}")
        
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('假正例率 (FPR)', fontsize=12)
        plt.ylabel('真正例率 (TPR)', fontsize=12)
        plt.title(title, fontsize=14)
        plt.legend(loc="lower right", fontsize=10)
        plt.grid(True, alpha=0.3)
        
        # 添加AUC解释
        plt.text(0.6, 0.2, 'AUC越接近1.0，\n模型性能越好', 
                bbox=dict(boxstyle="round,pad=0.3", facecolor="lightblue", alpha=0.7),
                fontsize=10)
        
        plt.tight_layout()
        plt.show()
        
        # 返回AUC分数排序
        auc_scores.sort(key=lambda x: x[1], reverse=True)
        print(f"\n=== AUC分数排名 ===")
        for i, (name, score) in enumerate(auc_scores):
            print(f"{i+1}. {name}: {score:.3f}")
        
        return auc_scores
    
    def plot_threshold_analysis(self, y_true, y_scores, model_name="模型"):
        """分析不同阈值对模型性能的影响"""
        fpr, tpr, thresholds = roc_curve(y_true, y_scores)
        
        # 计算不同阈值下的各种指标
        precisions = []
        recalls = []
        f1_scores = []
        
        for threshold in thresholds:
            y_pred = (y_scores >= threshold).astype(int)
            
            if len(np.unique(y_pred)) == 1:
                # 如果预测结果只有一个类别，设置默认值
                precisions.append(0)
                recalls.append(0)
                f1_scores.append(0)
            else:
                cm = confusion_matrix(y_true, y_pred)
                if cm.shape == (2, 2):
                    tn, fp, fn, tp = cm.ravel()
                    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
                    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
                    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
                    
                    precisions.append(precision)
                    recalls.append(recall)
                    f1_scores.append(f1)
                else:
                    precisions.append(0)
                    recalls.append(0)
                    f1_scores.append(0)
        
        # 可视化
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # ROC曲线
        axes[0, 0].plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC曲线 (AUC = {auc(fpr, tpr):.3f})')
        axes[0, 0].plot([0, 1], [0, 1], 'k--', alpha=0.5)
        axes[0, 0].set_xlabel('假正例率 (FPR)')
        axes[0, 0].set_ylabel('真正例率 (TPR)')
        axes[0, 0].set_title(f'{model_name} - ROC曲线')
        axes[0, 0].legend()
        axes[0, 0].grid(True, alpha=0.3)
        
        # 阈值 vs 精确率和召回率
        axes[0, 1].plot(thresholds, precisions, 'r-', label='精确率', linewidth=2)
        axes[0, 1].plot(thresholds, recalls, 'g-', label='召回率', linewidth=2)
        axes[0, 1].set_xlabel('分类阈值')
        axes[0, 1].set_ylabel('指标值')
        axes[0, 1].set_title('阈值 vs 精确率/召回率')
        axes[0, 1].legend()
        axes[0, 1].grid(True, alpha=0.3)
        
        # 阈值 vs F1分数
        axes[1, 0].plot(thresholds, f1_scores, 'purple', linewidth=2)
        axes[1, 0].set_xlabel('分类阈值')
        axes[1, 0].set_ylabel('F1分数')
        axes[1, 0].set_title('阈值 vs F1分数')
        axes[1, 0].grid(True, alpha=0.3)
        
        # 找到最优阈值（F1分数最大）
        best_f1_idx = np.argmax(f1_scores)
        best_threshold = thresholds[best_f1_idx]
        best_f1 = f1_scores[best_f1_idx]
        
        axes[1, 0].axvline(x=best_threshold, color='red', linestyle='--', 
                          label=f'最优阈值 = {best_threshold:.3f}')
        axes[1, 0].legend()
        
        # 精确率-召回率曲线
        axes[1, 1].plot(recalls, precisions, 'orange', linewidth=2)
        axes[1, 1].set_xlabel('召回率')
        axes[1, 1].set_ylabel('精确率')
        axes[1, 1].set_title('精确率-召回率曲线')
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        print(f"=== {model_name} 最优阈值分析 ===")
        print(f"最优阈值: {best_threshold:.3f}")
        print(f"最优F1分数: {best_f1:.3f}")
        print(f"对应精确率: {precisions[best_f1_idx]:.3f}")
        print(f"对应召回率: {recalls[best_f1_idx]:.3f}")
        
        return best_threshold, best_f1

# 使用ROC分析器
roc_analyzer = ROCAnalyzer()

# 获取概率预测
y_prob_lr = lr.predict_proba(X_test)[:, 1]  # 正例的概率
y_prob_dt = dt.predict_proba(X_test)[:, 1]

# 绘制ROC曲线对比
models_data = [
    ("逻辑回归", y_test, y_prob_lr),
    ("决策树", y_test, y_prob_dt)
]

auc_scores = roc_analyzer.plot_roc_curve(models_data, "模型性能对比 - ROC曲线")

# 分析最佳模型的阈值
best_model_name, best_auc = auc_scores[0]
if best_model_name == "逻辑回归":
    best_threshold, best_f1 = roc_analyzer.plot_threshold_analysis(y_test, y_prob_lr, "逻辑回归")
else:
    best_threshold, best_f1 = roc_analyzer.plot_threshold_analysis(y_test, y_prob_dt, "决策树")

AUC的几何意义和统计学解释

AUC = 随机选择的正例排名高于负例的概率

这是AUC最重要的统计学解释：

def auc_intuition_demo():
    """演示AUC的直观含义"""
    np.random.seed(42)
    
    # 创建简单的数据
    n_pos, n_neg = 100, 900
    
    # 模拟两个不同质量的模型
    # 好模型：正例得分普遍高于负例
    pos_scores_good = np.random.normal(0.7, 0.2, n_pos)
    neg_scores_good = np.random.normal(0.3, 0.2, n_neg)
    
    # 差模型：正例和负例得分重叠较多
    pos_scores_bad = np.random.normal(0.5, 0.3, n_pos)
    neg_scores_bad = np.random.normal(0.5, 0.3, n_neg)
    
    # 合并数据
    y_true = np.concatenate([np.ones(n_pos), np.zeros(n_neg)])
    scores_good = np.concatenate([pos_scores_good, neg_scores_good])
    scores_bad = np.concatenate([pos_scores_bad, neg_scores_bad])
    
    # 计算AUC
    auc_good = auc(*roc_curve(y_true, scores_good)[:2])
    auc_bad = auc(*roc_curve(y_true, scores_bad)[:2])
    
    # 手动计算AUC（通过排序统计）
    def manual_auc(y_true, y_scores):
        """手动计算AUC：正例排名高于负例的概率"""
        pos_scores = y_scores[y_true == 1]
        neg_scores = y_scores[y_true == 0]
        
        # 计算有多少个(正例, 负例)对中正例得分更高
        count = 0
        total = len(pos_scores) * len(neg_scores)
        
        for pos_score in pos_scores:
            for neg_score in neg_scores:
                if pos_score > neg_score:
                    count += 1
                elif pos_score == neg_score:
                    count += 0.5  # 平分
        
        return count / total
    
    manual_auc_good = manual_auc(y_true, scores_good)
    manual_auc_bad = manual_auc(y_true, scores_bad)
    
    # 可视化
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # 好模型的得分分布
    axes[0, 0].hist(pos_scores_good, alpha=0.7, label='正例', bins=30, color='red')
    axes[0, 0].hist(neg_scores_good, alpha=0.7, label='负例', bins=30, color='blue')
    axes[0, 0].set_title(f'好模型得分分布 (AUC = {auc_good:.3f})')
    axes[0, 0].set_xlabel('预测得分')
    axes[0, 0].set_ylabel('频次')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # 差模型的得分分布
    axes[0, 1].hist(pos_scores_bad, alpha=0.7, label='正例', bins=30, color='red')
    axes[0, 1].hist(neg_scores_bad, alpha=0.7, label='负例', bins=30, color='blue')
    axes[0, 1].set_title(f'差模型得分分布 (AUC = {auc_bad:.3f})')
    axes[0, 1].set_xlabel('预测得分')
    axes[0, 1].set_ylabel('频次')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # ROC曲线对比
    fpr_good, tpr_good, _ = roc_curve(y_true, scores_good)
    fpr_bad, tpr_bad, _ = roc_curve(y_true, scores_bad)
    
    axes[1, 0].plot(fpr_good, tpr_good, 'g-', linewidth=3, label=f'好模型 (AUC = {auc_good:.3f})')
    axes[1, 0].plot(fpr_bad, tpr_bad, 'r-', linewidth=3, label=f'差模型 (AUC = {auc_bad:.3f})')
    axes[1, 0].plot([0, 1], [0, 1], 'k--', alpha=0.5, label='随机分类器')
    axes[1, 0].set_xlabel('假正例率 (FPR)')
    axes[1, 0].set_ylabel('真正例率 (TPR)')
    axes[1, 0].set_title('ROC曲线对比')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # AUC解释
    auc_comparison = [
        ['指标', '好模型', '差模型'],
        ['sklearn AUC', f'{auc_good:.3f}', f'{auc_bad:.3f}'],
        ['手动计算 AUC', f'{manual_auc_good:.3f}', f'{manual_auc_bad:.3f}'],
        ['正例平均排名', f'{np.mean(np.argsort(np.argsort(scores_good))[:n_pos]):.1f}', 
         f'{np.mean(np.argsort(np.argsort(scores_bad))[:n_pos]):.1f}']
    ]
    
    axes[1, 1].axis('tight')
    axes[1, 1].axis('off')
    table = axes[1, 1].table(cellText=auc_comparison[1:], colLabels=auc_comparison[0],
                           cellLoc='center', loc='center')
    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1, 2)
    axes[1, 1].set_title('AUC对比分析')
    
    plt.tight_layout()
    plt.show()
    
    print("=== AUC直观理解 ===")
    print(f"好模型 AUC: {auc_good:.3f}")
    print(f"  -> 随机选择一个正例和一个负例，有{auc_good*100:.1f}%的概率正例得分更高")
    print(f"差模型 AUC: {auc_bad:.3f}")
    print(f"  -> 随机选择一个正例和一个负例，有{auc_bad*100:.1f}%的概率正例得分更高")
    print(f"随机分类器 AUC: 0.5")
    print(f"  -> 随机选择一个正例和一个负例，有50%的概率正例得分更高")

# 运行AUC直观演示
auc_intuition_demo()

老虎致远项目中的实战经验

在老虎致远的三年里，我在不同场景下应用了ROC分析：

场景1：欺诈检测系统

def fraud_detection_case_study():
    """欺诈检测系统的评估案例"""
    print("=== 欺诈检测系统评估案例 ===")
    print("背景：金融交易中，欺诈交易占比仅0.1%，但每笔欺诈损失巨大")
    
    # 模拟极度不平衡的欺诈检测数据
    np.random.seed(42)
    n_normal = 9900
    n_fraud = 100
    
    # 正常交易：低风险得分
    normal_scores = np.random.beta(2, 8, n_normal)  # 偏向低分
    # 欺诈交易：高风险得分
    fraud_scores = np.random.beta(6, 2, n_fraud)   # 偏向高分
    
    y_true = np.concatenate([np.zeros(n_normal), np.ones(n_fraud)])
    y_scores = np.concatenate([normal_scores, fraud_scores])
    
    # 计算ROC
    fpr, tpr, thresholds = roc_curve(y_true, y_scores)
    roc_auc = auc(fpr, tpr)
    
    # 业务指标分析
    def calculate_business_metrics(threshold, cost_per_fraud=10000, cost_per_false_alarm=10):
        y_pred = (y_scores >= threshold).astype(int)
        cm = confusion_matrix(y_true, y_pred)
        tn, fp, fn, tp = cm.ravel()
        
        # 业务成本计算
        fraud_loss = fn * cost_per_fraud      # 漏检的欺诈损失
        false_alarm_cost = fp * cost_per_false_alarm  # 误报的处理成本
        total_cost = fraud_loss + false_alarm_cost
        
        # 检测率
        detection_rate = tp / (tp + fn) if (tp + fn) > 0 else 0
        false_alarm_rate = fp / (fp + tn) if (fp + tn) > 0 else 0
        
        return {
            'threshold': threshold,
            'detection_rate': detection_rate,
            'false_alarm_rate': false_alarm_rate,
            'total_cost': total_cost,
            'fraud_loss': fraud_loss,
            'false_alarm_cost': false_alarm_cost,
            'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn
        }
    
    # 测试不同阈值的业务效果
    test_thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]
    results = []
    
    print(f"\n风险阈值分析：")
    print("-" * 80)
    print(f"{'阈值':<8} {'检测率':<8} {'误报率':<8} {'漏检损失':<12} {'误报成本':<12} {'总成本':<12}")
    print("-" * 80)
    
    for threshold in test_thresholds:
        result = calculate_business_metrics(threshold)
        results.append(result)
        print(f"{result['threshold']:<8.1f} "
              f"{result['detection_rate']:<8.3f} "
              f"{result['false_alarm_rate']:<8.3f} "
              f"${result['fraud_loss']:<11,.0f} "
              f"${result['false_alarm_cost']:<11,.0f} "
              f"${result['total_cost']:<11,.0f}")
    
    # 找到成本最小的阈值
    best_result = min(results, key=lambda x: x['total_cost'])
    print(f"\n最优业务阈值: {best_result['threshold']:.1f}")
    print(f"最小总成本: ${best_result['total_cost']:,.0f}")
    
    # 可视化
    plt.figure(figsize=(15, 5))
    
    # ROC曲线
    plt.subplot(1, 3, 1)
    plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {roc_auc:.3f})')
    plt.plot([0, 1], [0, 1], 'k--', alpha=0.5)
    plt.xlabel('假正例率 (误报率)')
    plt.ylabel('真正例率 (检测率)')
    plt.title('欺诈检测 ROC曲线')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # 成本分析
    plt.subplot(1, 3, 2)
    thresholds_plot = [r['threshold'] for r in results]
    total_costs = [r['total_cost'] for r in results]
    fraud_losses = [r['fraud_loss'] for r in results]
    false_alarm_costs = [r['false_alarm_cost'] for r in results]
    
    plt.plot(thresholds_plot, total_costs, 'ro-', label='总成本', linewidth=2)
    plt.plot(thresholds_plot, fraud_losses, 'b--', label='欺诈损失', alpha=0.7)
    plt.plot(thresholds_plot, false_alarm_costs, 'g--', label='误报成本', alpha=0.7)
    plt.xlabel('风险阈值')
    plt.ylabel('成本 ($)')
    plt.title('业务成本分析')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # 检测率 vs 误报率
    plt.subplot(1, 3, 3)
    detection_rates = [r['detection_rate'] for r in results]
    false_alarm_rates = [r['false_alarm_rate'] for r in results]
    
    plt.scatter(false_alarm_rates, detection_rates, c=total_costs, 
               cmap='Reds', s=100, alpha=0.7)
    plt.colorbar(label='总成本 ($)')
    plt.xlabel('误报率')
    plt.ylabel('检测率')
    plt.title('检测率 vs 误报率\n(颜色表示总成本)')
    plt.grid(True, alpha=0.3)
    
    # 标注每个点的阈值
    for i, result in enumerate(results):
        plt.annotate(f"{result['threshold']:.1f}", 
                    (result['false_alarm_rate'], result['detection_rate']),
                    xytext=(5, 5), textcoords='offset points', fontsize=9)
    
    plt.tight_layout()
    plt.show()

# 运行欺诈检测案例
fraud_detection_case_study()

评估指标的选择原则

基于在老虎致远的实战经验，我总结了以下原则：

业务场景	关注指标	原因
医疗诊断	召回率 > 精确率	漏诊代价远大于误诊
垃圾邮件过滤	精确率 > 召回率	误删正常邮件代价大
推荐系统	AUC	关注排序质量而非分类
欺诈检测	业务成本最小化	需要平衡检测和误报成本
信用评估	AUC + 校准度	需要概率预测的准确性

超越ROC：其他重要的评估指标

def comprehensive_evaluation_demo():
    """全面的模型评估演示"""
    # 创建多种不同特性的数据集
    datasets = {
        "平衡数据": make_classification(n_samples=1000, weights=[0.5, 0.5], random_state=42),
        "轻度不平衡": make_classification(n_samples=1000, weights=[0.8, 0.2], random_state=42),
        "严重不平衡": make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
    }
    
    from sklearn.metrics import precision_recall_curve, average_precision_score
    
    fig, axes = plt.subplots(3, 3, figsize=(18, 12))
    
    for i, (dataset_name, (X, y)) in enumerate(datasets.items()):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
        
        # 训练模型
        model = LogisticRegression(random_state=42)
        model.fit(X_train, y_train)
        y_scores = model.predict_proba(X_test)[:, 1]
        
        # ROC曲线
        fpr, tpr, _ = roc_curve(y_test, y_scores)
        roc_auc = auc(fpr, tpr)
        
        axes[i, 0].plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {roc_auc:.3f})')
        axes[i, 0].plot([0, 1], [0, 1], 'k--', alpha=0.5)
        axes[i, 0].set_xlabel('假正例率')
        axes[i, 0].set_ylabel('真正例率')
        axes[i, 0].set_title(f'{dataset_name} - ROC曲线')
        axes[i, 0].legend()
        axes[i, 0].grid(True, alpha=0.3)
        
        # PR曲线
        precision, recall, _ = precision_recall_curve(y_test, y_scores)
        pr_auc = average_precision_score(y_test, y_scores)
        
        axes[i, 1].plot(recall, precision, 'r-', linewidth=2, label=f'PR (AUC = {pr_auc:.3f})')
        # 基线（随机分类器在PR空间的性能）
        baseline = np.mean(y_test)
        axes[i, 1].axhline(y=baseline, color='k', linestyle='--', alpha=0.5, 
                          label=f'基线 = {baseline:.3f}')
        axes[i, 1].set_xlabel('召回率')
        axes[i, 1].set_ylabel('精确率')
        axes[i, 1].set_title(f'{dataset_name} - PR曲线')
        axes[i, 1].legend()
        axes[i, 1].grid(True, alpha=0.3)
        
        # 类别分布
        pos_ratio = np.mean(y_test)
        neg_ratio = 1 - pos_ratio
        
        axes[i, 2].bar(['负例', '正例'], [neg_ratio, pos_ratio], 
                      color=['lightblue', 'lightcoral'], alpha=0.7)
        axes[i, 2].set_ylabel('比例')
        axes[i, 2].set_title(f'{dataset_name} - 类别分布')
        axes[i, 2].set_ylim(0, 1)
        
        # 添加比例标注
        axes[i, 2].text(0, neg_ratio + 0.05, f'{neg_ratio:.3f}', 
                       ha='center', va='bottom', fontweight='bold')
        axes[i, 2].text(1, pos_ratio + 0.05, f'{pos_ratio:.3f}', 
                       ha='center', va='bottom', fontweight='bold')
        axes[i, 2].grid(True, alpha=0.3)
        
        print(f"\n=== {dataset_name} ===")
        print(f"正例比例: {pos_ratio:.3f}")
        print(f"ROC AUC: {roc_auc:.3f}")
        print(f"PR AUC: {pr_auc:.3f}")
        print(f"基线精确率: {baseline:.3f}")
    
    plt.tight_layout()
    plt.show()
    
    print(f"\n=== 评估指标选择建议 ===")
    print(f"1. 平衡数据集：ROC AUC 和 PR AUC 都可用")
    print(f"2. 轻度不平衡：优先使用 PR AUC")
    print(f"3. 严重不平衡：必须使用 PR AUC，ROC AUC 可能误导")

# 运行全面评估演示
comprehensive_evaluation_demo()

延伸阅读与技术演进

想要进一步了解模型评估和相关技术？推荐阅读：

总结：测量的艺术

在老虎致远的三年里，ROC分析教会了我一个重要道理：没有完美的评估指标，只有最适合业务场景的指标。

关键要点：

了解数据分布：平衡 vs 不平衡
理解业务成本：假阳性 vs 假阴性的代价
选择合适指标：ROC AUC vs PR AUC vs 自定义业务指标
阈值优化：基于业务目标而非单一指标

这种系统性的评估思维，不仅帮助我在后续的职业生涯中做出更好的技术决策，也让我理解了数据科学中"测量"的重要性。

正如管理学大师德鲁克所说："无法测量，就无法管理。"在机器学习的世界里，这句话同样适用：无法正确评估，就无法构建真正有价值的模型。

希望这篇文章能帮助你建立起完整的模型评估思维框架。在下一篇文章中，我将分享支持向量机的原理与实践，敬请期待！

开篇：一个医疗诊断的故事​

ROC曲线的诞生：从雷达到机器学习​

历史背景：二战时期的雷达操作员​

从信号检测到医学诊断​

混淆矩阵：一切的基础​

ROC曲线：可视化模型性能​

ROC曲线的核心概念​

AUC的几何意义和统计学解释​

AUC = 随机选择的正例排名高于负例的概率​

老虎致远项目中的实战经验​

场景1：欺诈检测系统​

评估指标的选择原则​

超越ROC：其他重要的评估指标​

延伸阅读与技术演进​

总结：测量的艺术​