DIN

场景：精排CTR预测
数据：Amazon-Electronics
DIN（Deep Interest Network）模型，是阿里提出的经典推荐模型，核心解决了传统 CTR 模型（如 DeepFM）无法捕捉用户动态兴趣的问题。
- DeepFM：把用户的所有历史行为（比如点击过的商品）当成 “静态特征”（拼接成固定长度的向量），无法区分哪些历史行为和当前推荐商品相关
- DIN：针对 “用户兴趣多样性” 问题，提出注意力机制，对用户历史行为做 “加权聚合”—— 和当前商品相关的历史行为权重高，无关的权重低，从而精准捕捉用户的即时兴趣。

1
2
3

用户历史行为：点击过 “篮球鞋”、“连衣裙”、“篮球”、“口红”；
当前推荐商品：“篮球”→ DIN 会给 “篮球鞋”、“篮球” 高权重，给 “连衣裙”、“口红” 低权重；
当前推荐商品：“口红”→ DIN 会给 “连衣裙”、“口红” 高权重，给 “篮球鞋”、“篮球” 低权重。

这就是 DIN 的核心：基于注意力的兴趣激活（Interest Activation）。

输入特征 → 嵌入层 → 分模块处理：
          ├─ 用户特征（性别/年龄等）→ 嵌入向量
          ├─ 商品特征（当前推荐商品）→ 嵌入向量（记为v_target）
          ├─ 用户历史行为序列（点击过的商品列表）→ 嵌入序列（记为{v_1, v_2, ..., v_n}）
              → 注意力层（DIN核心）：计算每个v_i与v_target的相关性权重 → 加权求和得到用户兴趣向量
          └─ 上下文特征（时间/场景等）→ 嵌入向量
→ 拼接所有特征向量 → 深层DNN → Sigmoid → 预估CTR

算法原理

1. 嵌入层

离散特征通过嵌入层转为低维稠密向量。
向量：
- $v_{target}$ 是待推荐的嵌入向量（维度 d）
- $v_i$ 是用户第 i 次历史行为商品的嵌入向量（维度 d）
- $V_{hist} = [v_1,v_2,...,v_n]$ 是历史行为序列，形状[n, d]，n是历史行为数量。

2. 兴趣激活层

核心是 “动态计算历史行为的权重”，而非简单的平均 / 求和。

注意力权重的计算逻辑
DIN 的注意力权重 $w_i$ 不是简单的向量内积，而是通过一个小型神经网络计算，能捕捉更复杂的相关性

$w_i = f(v_i, v_{target}, v_i \odot v_{target})$

其中 $\odot$ 是哈达玛积（对应元素相乘）， f() 是小型DNN通常两层全连接，输入就是上述三个的直接拼接（维度3d），输出是单个标量（权重 $w_i$ ）。
最终权重需要经过sigmoid 保证所有权重和为1。

用户兴趣向量的聚合
得到每个历史行为的权重后，对历史行为向量做加权求和，得到用户的激活兴趣向量：

$V_{interest} = \sum_{i=1}^n w_i \cdot v_i$

传统模型相当于是直接平均化， $w_i = \frac{1}{n}$ ，所有历史行为权重相同。

3. 深层 DNN 与输出层

拼接特征：将 $V_{interest}$ （用户激活兴趣）、 $v_{target}$ （当前商品）、用户静态特征、上下文特征的嵌入向量拼接，得到最终的输入向量；
深层 DNN：通过多层全连接网络学习高阶特征交互；
输出层：Sigmoid 函数将得分映射到 0~1，得到预估 CTR。

4. DIN 的关键优化

阿里在 DIN 中还提出了两个工程优化，进一步提升效果：
- Mini-batch Aware Regularization：解决小批次训练时的正则化偏差，让正则化更稳定；
- Activation Regularization：对注意力权重做正则化，避免过拟合。
  这两个是工程细节，核心还是注意力机制，新手先聚焦注意力层即可。

代码实现

直接调包

# 检查torch的安装以及gpu的使用
import torch
print(torch.__version__, torch.cuda.is_available())

import torch_rechub
import pandas as pd
import numpy as np
import tqdm
import sklearn

torch.manual_seed(2026) #固定随机种子

# 查看文件
file_path = '../examples/ranking/data/amazon-electronics/amazon_electronics_sample.csv'
data = pd.read_csv(file_path)
# data

from torch_rechub.utils.data import create_seq_features
# 构建用户的历史行为序列特征，内置函数create_seq_features只需要指定数据，和需要生成序列的特征，drop_short是选择舍弃行为序列较短的用户
train, val, test = create_seq_features(data, seq_feature_col=['item_id', 'cate_id'], drop_short=0)
# 查看当前构建的序列，在这个案例中我们创建了历史点击序列，和历史类别序列
# train

from torch_rechub.basic.features import DenseFeature, SparseFeature, SequenceFeature

n_users, n_items, n_cates = data["user_id"].max(), data["item_id"].max(), data["cate_id"].max()
# 这里指定每一列特征的处理方式，对于sparsefeature，需要输入embedding层，所以需要指定特征空间大小和输出的维度
features = [SparseFeature("target_item", vocab_size=n_items + 2, embed_dim=64),
            SparseFeature("target_cate", vocab_size=n_cates + 2, embed_dim=64),
            SparseFeature("user_id", vocab_size=n_users + 2, embed_dim=64)]
target_features = features
# 对于序列特征，除了需要和类别特征一样处理以外，item序列和候选item应该属于同一个空间，我们希望模型共享它们的embedding，所以可以通过shared_with参数指定
history_features = [
    SequenceFeature("history_item", vocab_size=n_items + 2, embed_dim=64, pooling="concat", shared_with="target_item"),
    SequenceFeature("history_cate", vocab_size=n_cates + 2, embed_dim=64, pooling="concat", shared_with="target_cate")
]

from torch_rechub.utils.data import df_to_dict, DataGenerator
# 指定label，生成模型的输入，这一步是转换为字典结构
train = df_to_dict(train)
val = df_to_dict(val)
test = df_to_dict(test)

train_y, val_y, test_y = train["label"], val["label"], test["label"]

del train["label"]
del val["label"]
del test["label"]
train_x, val_x, test_x = train, val, test


# 构建dataloader，指定模型读取数据的方式，和区分验证集测试集、指定batch大小
dg = DataGenerator(train_x, train_y)
train_dataloader, val_dataloader, test_dataloader = dg.generate_dataloader(x_val=val_x, y_val=val_y, x_test=test_x, y_test=test_y, batch_size=16)

# 最后查看一次输入模型的数据格式
# train_x

from torch_rechub.models.ranking import DIN
from torch_rechub.trainers import CTRTrainer

# 定义模型，模型的参数需要我们之前的feature类，用于构建模型的输入层，mlp指定模型后续DNN的结构，attention_mlp指定attention层的结构
model = DIN(features=features, history_features=history_features, target_features=target_features, mlp_params={"dims": [256, 128]}, attention_mlp_params={"dims": [256, 128]})

# 模型训练，需要学习率、设备等一般的参数，此外我们还支持earlystoping策略，及时发现过拟合
ctr_trainer = CTRTrainer(model, optimizer_params={"lr": 1e-3, "weight_decay": 1e-3}, n_epoch=3, earlystop_patience=4, device='cpu', model_path='./')
ctr_trainer.fit(train_dataloader, val_dataloader)

# 查看在测试集上的性能
auc = ctr_trainer.evaluate(ctr_trainer.model, test_dataloader)
print(f'test auc: {auc}')

自定义模型

import torch
import torch.nn as nn
from torch_rechub.basic.layers import MLP, EmbeddingLayer


class DIN(nn.Module):
    def __init__(self, features, history_features, target_features, mlp_params, attention_mlp_params):
        super().__init__()
        self.features = features
        self.history_features = history_features
        self.target_features = target_features
        self.num_history_features = len(history_features)
        self.all_dims = sum([fea.embed_dim for fea in features + history_features + target_features])

        self.embedding = EmbeddingLayer(features + history_features + target_features)
        self.attention_layers = nn.ModuleList([ActivationUnit(fea.embed_dim, **attention_mlp_params) for fea in self.history_features])
        self.mlp = MLP(self.all_dims, activation="dice", **mlp_params)

    def forward(self, x):
        # (batch_size, num_features, emb_dim)
        embed_x_features = self.embedding(x, self.features)
        # (batch_size, num_history_features, seq_length, emb_dim)
        embed_x_history = self.embedding(x, self.history_features)
        # (batch_size, num_target_features, emb_dim)
        embed_x_target = self.embedding(x, self.target_features)
        attention_pooling = []
        for i in range(self.num_history_features):
            attention_seq = self.attention_layers[i](embed_x_history[:, i, :, :], embed_x_target[:, i, :])
            attention_pooling.append(attention_seq.unsqueeze(1))  # (batch_size, 1, emb_dim)
        # (batch_size, num_history_features, emb_dim)
        attention_pooling = torch.cat(attention_pooling, dim=1)

        mlp_in = torch.cat([attention_pooling.flatten(start_dim=1), embed_x_target.flatten(start_dim=1), embed_x_features.flatten(start_dim=1)], dim=1)  # (batch_size, N)

        y = self.mlp(mlp_in)
        return torch.sigmoid(y.squeeze(1))


class ActivationUnit(nn.Module):
    def __init__(self, emb_dim, dims=None, activation="dice", use_softmax=False):
        super(ActivationUnit, self).__init__()
        if dims is None:
            dims = [36]
        self.emb_dim = emb_dim
        self.use_softmax = use_softmax
        self.attention = MLP(4 * self.emb_dim, dims=dims, activation=activation)

    def forward(self, history, target):
        seq_length = history.size(1)
        # (batch_size,seq_length,emb_dim)
        target = target.unsqueeze(1).expand(-1, seq_length, -1)
        att_input = torch.cat([target, history, target - history, target * history], dim=-1)  # batch_size,seq_length,4*emb_dim
        # (batch_size*seq_length,4*emb_dim)
        att_weight = self.attention(att_input.view(-1, 4 * self.emb_dim))
        # (batch_size*seq_length, 1) -> (batch_size,seq_length)
        att_weight = att_weight.view(-1, seq_length)
        if self.use_softmax:
            att_weight = att_weight.softmax(dim=-1)

        # (batch_size, seq_length, 1) * (batch_size, seq_length, emb_dim)
        # (batch_size,emb_dim)
        output = (att_weight.unsqueeze(-1) * history).sum(dim=1)
        return output