特征工程方法论

约 1413 字大约 5 分钟

feature-engineeringml

2025-09-05

特征工程是机器学习中最重要的环节之一——"数据和特征决定了模型的上限，而算法只是逼近这个上限"。本文系统介绍数值特征处理、类别编码、特征交叉、特征选择以及特征存储方案。

特征工程流程

数值特征处理

标准化（Standardization）

将特征缩放到均值为 0、标准差为 1 的分布。适合假设数据为正态分布的模型（线性回归、SVM、神经网络）。

归一化（Min-Max Normalization）

将特征缩放到 [0, 1] 区间。适合不假设数据分布的场景。

对数变换

对长尾分布的特征取对数，使其更接近正态分布。常用于收入、价格、访问量等正偏分布的数据。

分箱（Binning）

将连续特征离散化为区间。好处是增强模型对异常值的鲁棒性，捕获非线性关系。

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, KBinsDiscretizer

df = pd.DataFrame({
    'age': [25, 32, 47, 51, 62, 38, 29, 55],
    'income': [30000, 55000, 80000, 120000, 200000, 65000, 40000, 150000],
    'visits': [1, 5, 12, 3, 45, 8, 2, 100],
})

# 标准化
scaler = StandardScaler()
df['age_scaled'] = scaler.fit_transform(df[['age']])

# 归一化
minmax = MinMaxScaler()
df['income_normalized'] = minmax.fit_transform(df[['income']])

# 对数变换（处理长尾分布）
df['income_log'] = np.log1p(df['income'])
df['visits_log'] = np.log1p(df['visits'])

# 分箱
binner = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
df['age_binned'] = binner.fit_transform(df[['age']])

# 自定义分箱
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 45, 60, 100],
                         labels=['青年', '中年', '中老年', '老年'])

RobustScaler

对异常值鲁棒的缩放方法，使用中位数和四分位距而非均值和标准差：

from sklearn.preprocessing import RobustScaler

robust_scaler = RobustScaler()
df['income_robust'] = robust_scaler.fit_transform(df[['income']])

类别特征编码

One-Hot 编码

每个类别值变为一个独立的二值特征。适合类别数较少（<20）且无序的特征。

Label 编码

将类别映射为整数。适合有序类别或树模型。

Target 编码

用目标变量的均值替换类别值。需要注意数据泄露问题，应使用交叉验证方式计算。

Embedding 编码

使用神经网络学习类别的低维稠密表示，适合高基数类别特征。

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import category_encoders as ce

df = pd.DataFrame({
    'city': ['北京', '上海', '广州', '北京', '深圳', '上海', '广州', '北京'],
    'education': ['本科', '硕士', '博士', '本科', '硕士', '博士', '本科', '硕士'],
    'target': [0, 1, 1, 0, 1, 1, 0, 1],
})

# One-Hot 编码
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
city_encoded = ohe.fit_transform(df[['city']])

# Label 编码
le = LabelEncoder()
df['education_encoded'] = le.fit_transform(df['education'])

# Target 编码（带正则化防止过拟合）
te = ce.TargetEncoder(cols=['city'], smoothing=10)
df['city_target'] = te.fit_transform(df['city'], df['target'])

# 频率编码
freq_map = df['city'].value_counts(normalize=True).to_dict()
df['city_freq'] = df['city'].map(freq_map)

特征交叉

特征交叉通过组合多个特征来捕获交互效应，帮助线性模型处理非线性关系。

# 多项式特征交叉
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
features = df[['age', 'income', 'visits']]
crossed = poly.fit_transform(features)
print(poly.get_feature_names_out())
# ['age', 'income', 'visits', 'age income', 'age visits', 'income visits']

# 手工特征交叉
df['income_per_age'] = df['income'] / df['age']
df['visit_income_ratio'] = df['visits'] / (df['income'] / 10000)

# 类别特征交叉
df['city_education'] = df['city'] + '_' + df['education']

特征选择

基于相关性的过滤

import seaborn as sns
import matplotlib.pyplot as plt

# Pearson 相关系数
corr_matrix = df.select_dtypes(include=[np.number]).corr()

# 移除高度相关的特征（>0.95）
def remove_highly_correlated(df, threshold=0.95):
    corr = df.corr().abs()
    upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
    to_drop = [col for col in upper.columns if any(upper[col] > threshold)]
    return df.drop(columns=to_drop)

基于互信息的选择

from sklearn.feature_selection import mutual_info_classif, SelectKBest

# 互信息（适用于任意关系，不限于线性）
selector = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected = selector.fit_transform(X, y)

# 查看每个特征的互信息分数
mi_scores = mutual_info_classif(X, y)
feature_importance = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)

基于模型特征重要性

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

# 树模型内置特征重要性
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importance = pd.Series(rf.feature_importances_, index=X_train.columns)

# 排列重要性（更可靠）
perm_importance = permutation_importance(rf, X_test, y_test, n_repeats=10)
perm_imp = pd.Series(perm_importance.importances_mean, index=X_test.columns)

递归特征消除（RFE）

from sklearn.feature_selection import RFECV

rfecv = RFECV(estimator=rf, step=1, cv=5, scoring='accuracy', min_features_to_select=5)
rfecv.fit(X_train, y_train)
print(f"最优特征数: {rfecv.n_features_}")
selected_features = X_train.columns[rfecv.support_]

Feature Store

Feature Store 是管理特征全生命周期的基础设施，解决特征复用、一致性和实时服务问题。

主流方案包括 Feast、Tecton、Hopsworks 等。核心功能包含特征注册、版本管理、离线/在线一致性、特征监控。

# Feast Feature Store 示例
from feast import FeatureStore

store = FeatureStore(repo_path="feature_repo")

# 获取训练数据
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "user_features:age",
        "user_features:total_purchases",
        "item_features:category",
        "item_features:price",
    ],
).to_df()

# 在线服务
online_features = store.get_online_features(
    features=["user_features:age", "user_features:total_purchases"],
    entity_rows=[{"user_id": 12345}],
).to_dict()

总结

特征工程是机器学习项目中投入产出比最高的环节。核心原则包括：理解业务含义指导特征构造、选择合适的编码方式匹配模型类型、通过特征选择控制维度灾难、用 Feature Store 实现特征复用和一致性管理。

贡献者

withesse

更新日志

2026/3/14 13:09

查看所有更新日志

9f6c2-feat: organize wiki content and refresh site setup于 2026/3/14