天猫重复购买预测-03特征工程

1 工具导入

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import gc
from collections import Counter
import copy

import warnings
warnings.filterwarnings("ignore")
 
%matplotlib inline

2 数据读取

#读取数据集
test_data = pd.read_csv('./data/data_format1/test_format1.csv')
train_data = pd.read_csv('./data/data_format1/train_format1.csv')
user_info = pd.read_csv('./data/data_format1/user_info_format1.csv')
user_log = pd.read_csv('./data/data_format1/user_log_format1.csv')

数据资源查看

1	`train_data.info()`

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260864 entries, 0 to 260863
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   user_id      260864 non-null  int64
 1   merchant_id  260864 non-null  int64
 2   label        260864 non-null  int64
dtypes: int64(3)
memory usage: 6.0 MB

1	`test_data.info()`

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261477 entries, 0 to 261476
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      261477 non-null  int64  
 1   merchant_id  261477 non-null  int64  
 2   prob         0 non-null       float64
dtypes: float64(1), int64(2)
memory usage: 6.0 MB

1	`user_info.info()`

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 424170 entries, 0 to 424169
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   user_id    424170 non-null  int64  
 1   age_range  421953 non-null  float64
 2   gender     417734 non-null  float64
dtypes: float64(2), int64(1)
memory usage: 9.7 MB

1	`user_log.info()`

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54925330 entries, 0 to 54925329
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   user_id      int64  
 1   item_id      int64  
 2   cat_id       int64  
 3   seller_id    int64  
 4   brand_id     float64
 5   time_stamp   int64  
 6   action_type  int64  
dtypes: float64(1), int64(6)
memory usage: 2.9 GB

数据资源非常大，甚至达到2.9GB，需要进行数据压缩

3 对数据进行内存压缩

（1）定义内存压缩方法

# reduce memory
def reduce_mem_usage(df, verbose=True):
    # 开始时的内存使用
    start_mem = df.memory_usage().sum() / 1024**2
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    
    # 遍历每一列
    for col in df.columns:
        # 获取列的数据类型
        col_type = df[col].dtypes
        # 如果列的数据类型在数值类型中
        if col_type in numerics:
            # 获取列的最小值和最大值
            c_min = df[col].min()
            c_max = df[col].max()
            # 如果数据类型为int
            if str(col_type)[:3] == 'int':
                # 如果最小值大于int8的最小值，最大值小于int8的最大值
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    # 将列的数据类型转换为int8
                    df[col] = df[col].astype(np.int8)
                # 如果最小值大于int16的最小值，最大值小于int16的最大值
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    # 将列的数据类型转换为int16
                    df[col] = df[col].astype(np.int16)
                # 如果最小值大于int32的最小值，最大值小于int32的最大值
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    # 将列的数据类型转换为int32
                    df[col] = df[col].astype(np.int32)
                # 如果最小值大于int64的最小值，最大值小于int64的最大值
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    # 将列的数据类型转换为int64
                    df[col] = df[col].astype(np.int64)
            # 如果数据类型为float
            else:
                # 如果最小值大于float16的最小值，最大值小于float16的最大值
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    # 将列的数据类型转换为float16
                    df[col] = df[col].astype(np.float16)
                # 如果最小值大于float32的最小值，最大值小于float32的最大值
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    # 将列的数据类型转换为float32
                    df[col] = df[col].astype(np.float32)
                # 如果最小值大于float64的最小值，最大值小于float64的最大值
                else:
                    df[col] = df[col].astype(np.float64)
                    
    # 结束时的内存使用
    end_mem = df.memory_usage().sum() / 1024**2
    # 打印优化后的内存使用
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    # 打印优化率
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    # 返回优化后的DataFrame
    return df

代码解释

用于优化数据帧（dataframe）的内存使用，主要目的是将数据帧中的整数和浮点数类型转换为更小的数据类型，从而减少内存占用。

以下是代码的详细解释：

定义一个名为reduce_mem_usage的函数，接受一个数据帧df和一个布尔值verbose作为参数。
计算原始数据帧的内存使用情况，并将结果除以1024^2以转换为MB。
定义一个包含整数和浮点数类型的列表numerics。
遍历数据帧的列（for col in df.columns:）。
获取当前列的数据类型（col_type = df[col].dtypes）。
如果当前列的数据类型在numerics列表中（if col_type in numerics:），则进行以下操作：

a. 计算当前列的最小值和最大值（c_min = df[col].min()和c_max = df[col].max()）。

b. 如果当前列的数据类型是整数类型（if str(col_type)[:3] == 'int'），则检查当前列的最小值和最大值是否在整数类型的范围中。如果是，则将当前列转换为更小的整数类型（int8、int16、int32或int64）。

c. 如果当前列的数据类型是浮点数类型，则检查当前列的最小值和最大值是否在浮点数类型的范围中。如果是，则将当前列转换为更小的浮点数类型（float16、float32或64）。
计算优化后的数据帧的内存使用情况，并将结果除以1024^2以转换为MB。
打印优化后的数据帧的内存使用情况，以及与原始数据帧的内存使用情况的百分比差异。
返回优化后的数据帧。

（2）对数据进行内存压缩

1
2
3

# 数据读取函数
def read_csv(file_name, num_rows):
    return pd.read_csv(file_name, nrows=num_rows)

num_rows = None
num_rows = 200 * 10000 

train_file = './data/data_format1/train_format1.csv'
test_file = './data/data_format1/test_format1.csv'

user_info_file = './data/data_format1/user_info_format1.csv'
user_log_file = './data/data_format1/user_log_format1.csv'

train_data = reduce_mem_usage(read_csv(train_file, num_rows))
test_data = reduce_mem_usage(read_csv(test_file, num_rows))

user_info = reduce_mem_usage(read_csv(user_info_file, num_rows))
user_log = reduce_mem_usage(read_csv(user_log_file, num_rows))

Memory usage after optimization is: 1.74 MB
Decreased by 70.8%
Memory usage after optimization is: 3.49 MB
Decreased by 41.7%
Memory usage after optimization is: 3.24 MB
Decreased by 66.7%
Memory usage after optimization is: 32.43 MB
Decreased by 69.6%

（3）查看压缩后的数据信息

1	`train_data.info()`

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260864 entries, 0 to 260863
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   user_id      260864 non-null  int32
 1   merchant_id  260864 non-null  int16
 2   label        260864 non-null  int8 
dtypes: int16(1), int32(1), int8(1)
memory usage: 1.7 MB

1	`test_data.info()`

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261477 entries, 0 to 261476
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      261477 non-null  int32  
 1   merchant_id  261477 non-null  int16  
 2   prob         0 non-null       float64
dtypes: float64(1), int16(1), int32(1)
memory usage: 3.5 MB

1	`user_info.info()`

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 424170 entries, 0 to 424169
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   user_id    424170 non-null  int32  
 1   age_range  421953 non-null  float16
 2   gender     417734 non-null  float16
dtypes: float16(2), int32(1)
memory usage: 3.2 MB

1	`user_log.info()`

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   user_id      int32  
 1   item_id      int32  
 2   cat_id       int16  
 3   seller_id    int16  
 4   brand_id     float16
 5   time_stamp   int16  
 6   action_type  int8   
dtypes: float16(1), int16(3), int32(2), int8(1)
memory usage: 32.4 MB

4 数据处理

4.1 合并用户信息

# 合并训练集、测试集、用户信息表
del test_data['prob']
all_data = train_data.append(test_data)
all_data = all_data.merge(user_info,on=['user_id'],how='left')
del train_data, test_data, user_info
gc.collect()

1	`all_data.head()`

	user_id	merchant_id	label	age_range
0	34176	3906	0.0	6.0
1	34176	121	0.0	6.0
2	34176	4356	1.0	6.0
3	34176	2217	0.0	6.0
4	230784	4818	0.0	0.0

代码解释

这段代码的主要目的是合并训练集、测试集和用户信息表，并将它们合并为一个新的数据集all_data。

首先，代码删除了测试集中名为’prob’的列，因为它在训练集中不存在。
然后，将训练集和测试集合并为一个新的数据集all_data。这里使用了append()方法将测试集添加到训练集的末尾。。
接着，使用merge()方法将用户信息表和合并后的数据集all_data合并，使用’on’参数指定连接的列，使用’how’参数指定连接方式为左连接（即保留左表中的所有行，即使右表中没有匹配的行）。
最后，删除了训练集、测试集和用户信息表，使用del关键字。同时，调用gc.collect()函数释放内存。

del关键字用于删除变量或对象。当在Python中使用del关键字时，它会从内存中删除变量或对象的引用，从而释放该对象所占用的内存空间。
gc.collect()函数是Python的垃圾回收器（Garbage Collector）的接口，用于手动触发垃圾回收操作。在某些情况下，Python可能会无法及时回收垃圾，这时可以使用gc.collect()函数手动触发垃圾回收，以释放被占用的内存空间。

4.2 用户行为日志信息按时间进行排序

"""
按时间排序
"""
user_log = user_log.sort_values(['user_id','time_stamp'])

1	`user_log.head()`

	user_id	item_id	cat_id	seller_id	brand_id	time_stamp
61975	16	980982	437	650	4276.0	914
61976	16	980982	437	650	4276.0	914
61977	16	980982	437	650	4276.0	914
61978	16	962763	19	650	4276.0	914
61979	16	391126	437	650	4276.0	914

代码解释

对user_id列和time_stamp列进行排序的具体步骤如下：

首先，使用sort_values()方法对user_id列进行升序排序。升序排序意味着从小到大排列，即按user_id的顺序排列。
然后，使用sort_values()方法对time_stamp列进行升序排序。升序排序意味着从小到大排列，即按时间戳的顺序排列。

因此，经过这两步排序后，user_log数据框中的数据将按照用户ID和时间戳的顺序进行排列。

4.3 对每个用户逐个合并所有的字段

合并字段为item_id, cat_id,seller_id,brand_id,time_stamp, action_type

"""
合并数据
"""
list_join_func = lambda x: " ".join([str(i) for i in x])


agg_dict = {
            'item_id' : list_join_func,	
            'cat_id' : list_join_func,
            'seller_id' : list_join_func,
            'brand_id' : list_join_func,
            'time_stamp' : list_join_func,
            'action_type' : list_join_func
        }

rename_dict = {
            'item_id' : 'item_path',
            'cat_id' : 'cat_path',
            'seller_id' : 'seller_path',
            'brand_id' : 'brand_path',
            'time_stamp' : 'time_stamp_path',
            'action_type' : 'action_type_path'
        }

# def merge_list(df_ID, join_columns, df_data, agg_dict, rename_dict):
#     # 对df_data按照join_columns进行分组，并使用agg_dict进行聚合操作，然后重命名列
#     df_data = df_data.\
#                 groupby(join_columns).\
#                 agg(agg_dict).\
#                 reset_index().\
#                 rename(columns=rename_dict)
    
#     # 将df_ID和df_data按照join_columns进行左连接
#     df_ID = df_ID.merge(df_data, on=join_columns, how="left") 
#     return df_data,df_ID
# all_data = merge_list(all_data, 'user_id', user_log, agg_dict, rename_dict)

1	`#all_data`

代码解释

这段代码的主要目的是合并数据，将多个列组合成一个字符串，并将结果合并到现有的数据框中。实现原理如下：

使用lambda函数定义一个名为list_join_func的函数，该函数接受一个列表参数，并将列表中的元素以空格分隔拼接成一个字符串。
定义一个名为agg_dict的字典，该字典包含要进行聚合操作的列及其对应的聚合函数。例如，'item_id' : list_join_func表示要将item_id列的值拼接成一个字符串。
定义一个名为rename_dict的字典，该字典包含要重命名的列。例如，'item_id' : 'item_path'表示要将item_id列重命名为item_path。
定义一个名为merge_list的函数，该函数接受四个参数：df_ID、join_columns、df_data和agg_dict。函数首先对df_data按照join_columns进行分组，并使用agg_dict进行聚合操作。然后，将结果重命名列，并返回重命名后的数据框。
在主程序中，首先对all_data和user_log按照user_id进行左连接。然后，将合并后的数据帧传递给merge_list函数，并将结果更新到all_data。

1 2	`user_log_path = user_log.groupby('user_id').agg(agg_dict).reset_index().rename(columns=rename_dict) user_log_path.head()`

	user_id	item_path	cat_path	seller_path	brand_path	time_stamp_path	action_type_path
0	16	980982 980982 980982 962763 391126 827174 6731...	437 437 437 19 437 437 437 437 895 19 437 437 ...	650 650 650 650 650 650 650 650 3948 650 650 6...	4276.0 4276.0 4276.0 4276.0 4276.0 4276.0 4276...	914 914 914 914 914 914 914 914 914 914 914 91...	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 0 ...
1	19	388018 388018 88673 88673 88673 88673 846066 5...	949 949 614 614 614 614 420 1401 948 948 513 1...	2772 2772 4066 4066 4066 4066 4951 4951 2872 2...	2112.0 2112.0 1552.0 1552.0 1552.0 1552.0 5200...	710 710 711 711 711 711 908 908 1105 1105 1105...	0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
2	41	60215 1004605 60215 60215 60215 60215 628525 5...	1308 1308 1308 1308 1308 1308 1271 656 656 656...	2128 3207 2128 2128 2128 2128 3142 4618 4618 4...	3848.0 3848.0 3848.0 3848.0 3848.0 3848.0 1014...	521 521 521 521 521 522 529 828 828 828 828 82...	0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 ...
3	56	889499 528459 765746 553259 889499 22435 40047...	662 1075 662 1577 662 11 184 1604 11 11 177 11...	4048 601 3104 3828 4048 4766 2419 2768 2565 26...	5360.0 1040.0 8240.0 1446.0 5360.0 4360.0 3428...	517 520 525 528 602 602 610 610 610 610 610 61...	3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 ...
4	155	979639 890128 981780 211366 211366 797946 4567...	267 1271 1505 267 267 1075 1075 407 407 1075 4...	2429 4785 3784 800 800 1595 1418 2662 2662 315...	2276.0 1422.0 5692.0 6328.0 6328.0 5800.0 7140...	529 529 602 604 604 607 607 607 607 607 607 60...	0 0 0 2 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 2 ...

1 2	`all_data_path = all_data.merge(user_log_path,on='user_id') all_data_path.head()`

	user_id	merchant_id	age_range	gender	item_path	cat_path	seller_path	brand_path	time_stamp_path	action_type_path
0	105600	1487	6.0	1.0	986160 681407 681407 910680 681407 592698 3693...	35 1554 1554 119 1554 662 1095 662 35 833 833 ...	4811 4811 4811 1897 4811 3315 2925 1340 1875 4...	127.0 127.0 127.0 4704.0 127.0 1605.0 6000.0 1...	518 518 518 520 520 524 524 524 525 525 525 52...	2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
1	110976	159	5.0	0.0	396970 961553 627712 926681 1012423 825576 149...	1023 420 407 1505 962 602 184 1606 351 1505 11...	1435 1648 223 3178 2418 1614 3004 2511 2285 78...	5504.0 7780.0 1751.0 7540.0 6652.0 8116.0 5328...	517 520 522 522 527 530 530 530 601 601 602 60...	2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0 ...
2	374400	302	5.0	1.0	256546 202393 927572 2587 10956 549283 270303 ...	1188 646 1175 1188 1414 681 1175 681 681 115 1...	805 390 4252 3979 1228 2029 2029 2029 4252 923...	1842.0 5920.0 133.0 6304.0 7584.0 133.0 133.0 ...	517 604 604 604 607 609 609 609 609 615 621 62...	2 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
3	189312	1760	4.0	0.0	290583 166235 556025 217894 166235 556025 5589...	601 601 601 601 601 601 601 601 601 601 601 60...	3139 3139 3524 3139 3139 3524 3139 3139 3139 3...	549.0 549.0 549.0 549.0 549.0 549.0 549.0 549....	924 924 924 924 924 924 924 924 924 924 924 92...	0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
4	189312	2511	4.0	0.0	290583 166235 556025 217894 166235 556025 5589...	601 601 601 601 601 601 601 601 601 601 601 60...	3139 3139 3524 3139 3139 3524 3139 3139 3139 3...	549.0 549.0 549.0 549.0 549.0 549.0 549.0 549....	924 924 924 924 924 924 924 924 924 924 924 92...	0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...

4.4 删除数据并回收内存

"""
删除不需要的数据
"""
del user_log
gc.collect()

5 定义数据统计函数

5.1 定义统计函数

(1)定义统计数据总数的函数

def cnt_(x):
    try:
        return len(x.split(' '))
    except:
        return -1

(2)定义统计数据唯一值总数的函数

def nunique_(x):
    try:
        return len(set(x.split(' ')))
    except:
        return -1

(3)定义统计数据最大值的函数

def max_(x):
    try:
        return np.max([int(i) for i in x.split(' ')])
    except:
        return -1

(4)定义统计数据最小值的函数

def min_(x):
    try:
        return np.min([int(i) for i in x.split(' ')])
    except:
        return -1

(5)定义统计数据标准差的函数

def std_(x):
    try:
        return np.std([float(i) for i in x.split(' ')])
    except:
        return -1

(6)定义统计数据中topN数据的函数

def most_n(x, n):
    try:
        return Counter(x.split(' ')).most_common(n)[n-1][0]
    except:
        return -1

(7)定义统计数据中topN数据总数的函数

def most_n_cnt(x, n):
    try:
        return Counter(x.split(' ')).most_common(n)[n-1][1]
    except:
        return -1

5.2 调用定义的统计函数

调用数据集的特征统计函数

def user_cnt(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(cnt_)
    return df_data

def user_nunique(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(nunique_)
    return df_data
    
def user_max(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(max_)
    return df_data

def user_min(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(min_)
    return df_data
    
def user_std(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(std_)
    return df_data

def user_most_n(df_data, single_col, name, n=1):
    func = lambda x: most_n(x, n)
    df_data[name] = df_data[single_col].apply(func)
    return df_data

def user_most_n_cnt(df_data, single_col, name, n=1):
    func = lambda x: most_n_cnt(x, n)
    df_data[name] = df_data[single_col].apply(func)
    return df_data

6 提取统计特征

6.1 特征统计

(1)店铺特征统计

统计与店铺特点有关的特征，如店铺、商品、品牌等。

"""
    提取基本统计特征
"""
#all_data_test = all_data
all_data_test = all_data_path
# 统计用户 点击、浏览、加购、购买行为
# 总次数
all_data_test = user_cnt(all_data_test,  'seller_path', 'user_cnt')
# 不同店铺个数
all_data_test = user_nunique(all_data_test,  'seller_path', 'seller_nunique')
# 不同品类个数
all_data_test = user_nunique(all_data_test,  'cat_path', 'cat_nunique')
# 不同品牌个数
all_data_test = user_nunique(all_data_test,  'brand_path', 'brand_nunique')
# 不同商品个数
all_data_test = user_nunique(all_data_test,  'item_path', 'item_nunique')
# 活跃天数
all_data_test = user_nunique(all_data_test,  'time_stamp_path', 'time_stamp_nunique')
# 不用行为种数
all_data_test = user_nunique(all_data_test,  'action_type_path', 'action_type_nunique')
# ....

1	`all_data_test.head()`

	user_id	merchant_id	age_range	gender	item_path	cat_path	seller_path	brand_path	time_stamp_path	action_type_path	user_cnt	seller_nunique	cat_nunique	brand_nunique	item_nunique	time_stamp_nunique	action_type_nunique
0	105600	1487	6.0	1.0	986160 681407 681407 910680 681407 592698 3693...	35 1554 1554 119 1554 662 1095 662 35 833 833 ...	4811 4811 4811 1897 4811 3315 2925 1340 1875 4...	127.0 127.0 127.0 4704.0 127.0 1605.0 6000.0 1...	518 518 518 520 520 524 524 524 525 525 525 52...	2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...	310	96	37	88	217	29	2
1	110976	159	5.0	0.0	396970 961553 627712 926681 1012423 825576 149...	1023 420 407 1505 962 602 184 1606 351 1505 11...	1435 1648 223 3178 2418 1614 3004 2511 2285 78...	5504.0 7780.0 1751.0 7540.0 6652.0 8116.0 5328...	517 520 522 522 527 530 530 530 601 601 602 60...	2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0 ...	274	181	70	159	233	52	3
2	374400	302	5.0	1.0	256546 202393 927572 2587 10956 549283 270303 ...	1188 646 1175 1188 1414 681 1175 681 681 115 1...	805 390 4252 3979 1228 2029 2029 2029 4252 923...	1842.0 5920.0 133.0 6304.0 7584.0 133.0 133.0 ...	517 604 604 604 607 609 609 609 609 615 621 62...	2 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...	278	57	59	62	148	35	3
3	189312	1760	4.0	0.0	290583 166235 556025 217894 166235 556025 5589...	601 601 601 601 601 601 601 601 601 601 601 60...	3139 3139 3524 3139 3139 3524 3139 3139 3139 3...	549.0 549.0 549.0 549.0 549.0 549.0 549.0 549....	924 924 924 924 924 924 924 924 924 924 924 92...	0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...	237	49	35	45	170	9	2
4	189312	2511	4.0	0.0	290583 166235 556025 217894 166235 556025 5589...	601 601 601 601 601 601 601 601 601 601 601 60...	3139 3139 3524 3139 3139 3524 3139 3139 3139 3...	549.0 549.0 549.0 549.0 549.0 549.0 549.0 549....	924 924 924 924 924 924 924 924 924 924 924 92...	0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...	237	49	35	45	170	9	2

# 最晚时间
all_data_test = user_max(all_data_test,  'action_type_path', 'time_stamp_max')
# 最早时间
all_data_test = user_min(all_data_test,  'action_type_path', 'time_stamp_min')
# 活跃天数方差
all_data_test = user_std(all_data_test,  'action_type_path', 'time_stamp_std')
# 最早和最晚相差天数
all_data_test['time_stamp_range'] = all_data_test['time_stamp_max'] - all_data_test['time_stamp_min']

# 用户最喜欢的店铺
all_data_test = user_most_n(all_data_test, 'seller_path', 'seller_most_1', n=1)
# 最喜欢的类目
all_data_test = user_most_n(all_data_test, 'cat_path', 'cat_most_1', n=1)
# 最喜欢的品牌
all_data_test = user_most_n(all_data_test, 'brand_path', 'brand_most_1', n=1)
# 最常见的行为动作
all_data_test = user_most_n(all_data_test, 'action_type_path', 'action_type_1', n=1)
# .....

# 用户最喜欢的店铺 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'seller_path', 'seller_most_1_cnt', n=1)
# 最喜欢的类目 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'cat_path', 'cat_most_1_cnt', n=1)
# 最喜欢的品牌 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'brand_path', 'brand_most_1_cnt', n=1)
# 最常见的行为动作 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'action_type_path', 'action_type_1_cnt', n=1)
# .....

(2)用户特征统计

对用户的点击、加购、购买、收藏等特征进行统计。

# 对点击、加购、购买、收藏 分开统计
"""
统计基本特征函数  
-- 知识点二
-- 根据不同行为的业务函数
-- 提取不同特征
"""
def col_cnt_(df_data, columns_list, action_type):
    # 定义一个名为col_cnt_的函数，参数分别为df_data（数据框），columns_list（列名列表），action_type（行为类型）
    try:
        # 定义一个名为data_dict的字典
        data_dict = {}

        # 将columns_list复制一份，命名为col_list
        col_list = copy.deepcopy(columns_list)
        # 如果action_type不为空，将action_type_path添加到col_list中
        if action_type != None:
            col_list += ['action_type_path']

        # 将df_data中的每一列按照空格分割，存入data_dict中
        for col in col_list:
            data_dict[col] = df_data[col].split(' ')

        # 获取data_dict中每一列的长度
        path_len = len(data_dict[col])

        # 定义一个名为data_out的空列表
        data_out = []
        # 遍历data_dict中每一列的长度
        for i_ in range(path_len):
            # 定义一个名为data_txt的空字符串
            data_txt = ''
            # 遍历columns_list中的每一列
            for col_ in columns_list:
                # 如果action_type_path中当前行的值为action_type，将data_dict中当前列的值添加到data_txt中
                if data_dict['action_type_path'][i_] == action_type:
                    data_txt += '_' + data_dict[col_][i_]
            # 将data_txt添加到data_out中
            data_out.append(data_txt)

        # 返回data_out的长度
        return len(data_out)  
    except:
        # 如果发生异常，返回-1
        return -1

def col_nuique_(df_data, columns_list, action_type):
    try:
        data_dict = {}

        col_list = copy.deepcopy(columns_list)
        if action_type != None:
            col_list += ['action_type_path']

        for col in col_list:
            data_dict[col] = df_data[col].split(' ')

        path_len = len(data_dict[col])

        data_out = []
        for i_ in range(path_len):
            data_txt = ''
            for col_ in columns_list:
                if data_dict['action_type_path'][i_] == action_type:
                    data_txt += '_' + data_dict[col_][i_]
            data_out.append(data_txt)

        return len(set(data_out))
    except:
        return -1
    

def user_col_cnt(df_data, columns_list, action_type, name):
    df_data[name] = df_data.apply(lambda x: col_cnt_(x, columns_list, action_type), axis=1)
    return df_data

def user_col_nunique(df_data, columns_list, action_type, name):
    df_data[name] = df_data.apply(lambda x: col_nuique_(x, columns_list, action_type), axis=1)
    return df_data

代码解释

这段代码用于对用户在购物网站上的点击、加购、购买和收藏行为进行统计。实现原理是使用DataFrame的apply方法对数据进行处理，根据不同的行为类型和业务逻辑对数据进行提取和统计。

功能：

col_cnt_函数：统计指定列（columns_list）中每个路径的重复次数。
col_nuique_函数：统计指定列（columns_list）中每个路径去重后的次数。
user_col_cnt函数：将统计结果存储在DataFrame的指定列（name）中。
user_col_nunique函数：将统计结果存储在DataFrame的指定列（name）中。

(3)统计用户和店铺的关系

统计店铺被用户点击次数，加购次数，购买次数，收藏次数

# 点击次数
all_data_test = user_col_cnt(all_data_test,  ['seller_path'], '0', 'user_cnt_0')
# 加购次数
all_data_test = user_col_cnt(all_data_test,  ['seller_path'], '1', 'user_cnt_1')
# 购买次数
all_data_test = user_col_cnt(all_data_test,  ['seller_path'], '2', 'user_cnt_2')
# 收藏次数
all_data_test = user_col_cnt(all_data_test,  ['seller_path'], '3', 'user_cnt_3')


# 不同店铺个数
all_data_test = user_col_nunique(all_data_test,  ['seller_path'], '0', 'seller_nunique_0')
# ....

1	`all_data_test`

	user_id	merchant_id	label	age_range	gender	item_path	cat_path	seller_path	brand_path	time_stamp_path	...	action_type_1	seller_most_1_cnt	cat_most_1_cnt	brand_most_1_cnt	action_type_1_cnt	user_cnt_0	user_cnt_1	user_cnt_2	user_cnt_3	seller_nunique_0
0	105600	1487	0.0	6.0	1.0	986160 681407 681407 910680 681407 592698 3693...	35 1554 1554 119 1554 662 1095 662 35 833 833 ...	4811 4811 4811 1897 4811 3315 2925 1340 1875 4...	127.0 127.0 127.0 4704.0 127.0 1605.0 6000.0 1...	518 518 518 520 520 524 524 524 525 525 525 52...	...	0	35	43	35	299	310	310	310	310	97
1	110976	159	0.0	5.0	0.0	396970 961553 627712 926681 1012423 825576 149...	1023 420 407 1505 962 602 184 1606 351 1505 11...	1435 1648 223 3178 2418 1614 3004 2511 2285 78...	5504.0 7780.0 1751.0 7540.0 6652.0 8116.0 5328...	517 520 522 522 527 530 530 530 601 601 602 60...	...	0	9	56	11	259	274	274	274	274	181
2	374400	302	0.0	5.0	1.0	256546 202393 927572 2587 10956 549283 270303 ...	1188 646 1175 1188 1414 681 1175 681 681 115 1...	805 390 4252 3979 1228 2029 2029 2029 4252 923...	1842.0 5920.0 133.0 6304.0 7584.0 133.0 133.0 ...	517 604 604 604 607 609 609 609 609 615 621 62...	...	0	93	29	48	241	278	278	278	278	56
3	189312	1760	0.0	4.0	0.0	290583 166235 556025 217894 166235 556025 5589...	601 601 601 601 601 601 601 601 601 601 601 60...	3139 3139 3524 3139 3139 3524 3139 3139 3139 3...	549.0 549.0 549.0 549.0 549.0 549.0 549.0 549....	924 924 924 924 924 924 924 924 924 924 924 92...	...	0	45	68	45	228	237	237	237	237	50
4	189312	2511	0.0	4.0	0.0	290583 166235 556025 217894 166235 556025 5589...	601 601 601 601 601 601 601 601 601 601 601 60...	3139 3139 3524 3139 3139 3524 3139 3139 3139 3...	549.0 549.0 549.0 549.0 549.0 549.0 549.0 549....	924 924 924 924 924 924 924 924 924 924 924 92...	...	0	45	68	45	228	237	237	237	237	50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
16859	120191	1899	NaN	4.0	0.0	793882 288225 288225 288225 195714 195714 1957...	387 35 35 35 1213 1213 1213 1213 1075 447 1213...	1146 696 696 696 1200 1200 1200 1200 2702 4279...	8064.0 3600.0 3600.0 3600.0 2276.0 2276.0 2276...	512 516 516 516 606 606 606 606 606 606 606 60...	...	0	11	14	11	69	96	96	96	96	29
16860	121727	4044	NaN	4.0	0.0	544029 562170 544029 562170 544029 544029 5621...	1505 662 1505 662 1505 1505 662 1505 1505 1505...	795 1910 795 1910 795 795 1910 795 795 795 411...	3608.0 950.0 3608.0 950.0 3608.0 3608.0 950.0 ...	628 628 628 628 628 628 628 628 628 628 710 71...	...	0	12	12	12	43	49	49	49	49	15
16861	385919	3912	NaN	0.0	0.0	187936 187936 657875 969054 462255 985073 1602...	602 602 1389 1505 1228 1604 1228 662 1228 662 ...	661 661 643 643 4738 643 3740 643 4738 643 643...	1484.0 1484.0 968.0 968.0 6220.0 968.0 4072.0 ...	512 512 513 513 524 524 524 524 524 524 526 60...	...	0	33	19	33	44	54	54	54	54	12
16862	215423	4356	NaN	5.0	0.0	885364 938282 966141 174392 885364 821661 3473...	1389 662 1095 1095 1389 662 1577 821 662 1389 ...	2602 2602 2602 2602 2602 2602 3128 2602 2602 2...	1900.0 1900.0 1900.0 1900.0 1900.0 1900.0 8392...	521 521 521 521 521 521 521 521 521 521 521 52...	...	0	63	58	63	118	152	152	152	152	25
16863	215423	1840	NaN	5.0	0.0	885364 938282 966141 174392 885364 821661 3473...	1389 662 1095 1095 1389 662 1577 821 662 1389 ...	2602 2602 2602 2602 2602 2602 3128 2602 2602 2...	1900.0 1900.0 1900.0 1900.0 1900.0 1900.0 8392...	521 521 521 521 521 521 521 521 521 521 521 52...	...	0	63	58	63	118	152	152	152	152	25

16864 rows × 35 columns

6.2 特征组合

（1）特征组合进行业务特征提取

# 点击次数
all_data_test = user_col_cnt(all_data_test,  ['seller_path', 'item_path'], '0', 'user_cnt_0')

# 不同店铺个数
all_data_test = user_col_nunique(all_data_test,  ['seller_path', 'item_path'], '0', 'seller_nunique_0')
# ....

（2）查看提取的特征

1	`all_data_test.columns`

Index(['user_id', 'merchant_id', 'label', 'age_range', 'gender', 'item_path',
       'cat_path', 'seller_path', 'brand_path', 'time_stamp_path',
       'action_type_path', 'user_cnt', 'seller_nunique', 'cat_nunique',
       'brand_nunique', 'item_nunique', 'time_stamp_nunique',
       'action_type_nunique', 'time_stamp_max', 'time_stamp_min',
       'time_stamp_std', 'time_stamp_range', 'seller_most_1', 'cat_most_1',
       'brand_most_1', 'action_type_1', 'seller_most_1_cnt', 'cat_most_1_cnt',
       'brand_most_1_cnt', 'action_type_1_cnt', 'user_cnt_0', 'user_cnt_1',
       'user_cnt_2', 'user_cnt_3', 'seller_nunique_0'],
      dtype='object')

1	`list(all_data_test.columns)`

['user_id',
 'merchant_id',
 'label',
 'age_range',
 'gender',
 'item_path',
 'cat_path',
 'seller_path',
 'brand_path',
 'time_stamp_path',
 'action_type_path',
 'user_cnt',
 'seller_nunique',
 'cat_nunique',
 'brand_nunique',
 'item_nunique',
 'time_stamp_nunique',
 'action_type_nunique',
 'time_stamp_max',
 'time_stamp_min',
 'time_stamp_std',
 'time_stamp_range',
 'seller_most_1',
 'cat_most_1',
 'brand_most_1',
 'action_type_1',
 'seller_most_1_cnt',
 'cat_most_1_cnt',
 'brand_most_1_cnt',
 'action_type_1_cnt',
 'user_cnt_0',
 'user_cnt_1',
 'user_cnt_2',
 'user_cnt_3',
 'seller_nunique_0']

7 利用Countvector，TF-IDF提取特征

(1)利用Countvector和TF-IDF提取特征，代码如下：

"""
-- 知识点四
-- 利用countvector，tfidf提取特征
"""
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from scipy import sparse
# cntVec = CountVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=100)
tfidfVec = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=100)


# columns_list = ['seller_path', 'cat_path', 'brand_path', 'action_type_path', 'item_path', 'time_stamp_path']
columns_list = ['seller_path']
for i, col in enumerate(columns_list):
    all_data_test[col] = all_data_test[col].astype(str)
    tfidfVec.fit(all_data_test[col])
    data_ = tfidfVec.transform(all_data_test[col])
    if i == 0:
        data_cat = data_
    else:
        data_cat = sparse.hstack((data_cat, data_))

代码解释

这段代码是针对文本数据的向量化处理。代码主要使用了TfidfVectorizer进行文本特征的提取。以下是对代码逐行的解释：

首先，导入必要的库和方法。CountVectorizer和TfidfVectorizer用于将文本数据转换为向量形式。ENGLISH_STOP_WORDS提供英文停用词列表，这些词通常在文本处理中被过滤掉，因为它们通常不含有相关信息。sparse来自scipy库，用于处理稀疏矩阵。
注释掉了一个创建CountVectorizer的实例的行，这意味着作者可能原本计划使用它，但最后选择了TfidfVectorizer。停用词被设为英语停用词，ngram_range为(1, 1)，表示只考虑单个词语（一元模型），max_features=100表示仅保留词频最高的100个词。
TfidfVectorizer同样创建了一个实例，配置和CountVectorizer几乎相同，但是TfidfVectorizer计算的是词的TF-IDF值，这是词频（TF）和逆文档频率（IDF）的乘积，能够更好地表示词的重要性。
columns_list定义了需要进行特征提取的列名称列表，在这段代码中，作者只选择了'seller_path'一列进行特征提取。
通过遍历columns_list，对每个列采取如下操作：
- 首先将列的数据类型转换为字符串类型。
- 使用tfidfVec对象的fit方法来“学习”这一列的词汇和IDF值。
- 使用transform方法将这列的文本转换为TF-IDF的稀疏矩阵形式。
如果这是第一个列（由i == 0检查），则将转换后得到的稀疏矩阵赋值给data_cat。如果不是第一个列，则使用scipy库中的sparse.hstack方法将新的稀疏矩阵与前面的矩阵横向拼接。
结果是data_cat变量现在包含了'seller_path'列经过TF-IDF转换后的特征矩阵。若columns_list中有更多列，则data_cat会依据上述逻辑横向拼接它们的TF-IDF矩阵。

这段代码的目的是生成用于后续机器学习模型训练的特征，这种特征提取方法在文本分类、情感分析等NLP任务中非常常见。

(2)特征重命名特征合并

1
2
3

df_tfidf = pd.DataFrame(data_cat.toarray())
df_tfidf.columns = ['tfidf_' + str(i) for i in df_tfidf.columns]
all_data_test = pd.concat([all_data_test, df_tfidf],axis=1)

8 嵌入(embeeding)特征

import gensim

# Train Word2Vec model

model = gensim.models.Word2Vec(all_data_test['seller_path'].apply(lambda x: x.split(' ')), vector_size=100, window=5, min_count=5, workers=4)
# model.save("product2vec.model")
# model = gensim.models.Word2Vec.load("product2vec.model")

def mean_w2v_(x, model, vector_size=100):
    try:
        i = 0
        for word in x.split(' '):
            if word in model.wv.vocab:
                i += 1
                if i == 1:
                    vec = np.zeros(vector_size)
                vec += model.wv[word]
        return vec / i 
    except:
        return  np.zeros(vector_size)


def get_mean_w2v(df_data, columns, model, vector_size):
    data_array = []
    for index, row in df_data.iterrows():
        w2v = mean_w2v_(row[columns], model, vector_size)
        data_array.append(w2v)
    return pd.DataFrame(data_array)

df_embeeding = get_mean_w2v(all_data_test, 'seller_path', model, 100)
df_embeeding.columns = ['embeeding_' + str(i) for i in df_embeeding.columns]

embeeding特征和原始特征合并

1	`all_data_test = pd.concat([all_data_test, df_embeeding],axis=1)`

9 stacking特征

"""
-- 知识点六
-- stacking特征
"""
# from sklearn.cross_validation import KFold
from sklearn.model_selection import KFold
import pandas as pd
import numpy as np
from scipy import sparse
import xgboost
import lightgbm
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,ExtraTreesClassifier
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor,ExtraTreesRegressor
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.svm import LinearSVC,SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss,mean_absolute_error,mean_squared_error
from sklearn.naive_bayes import MultinomialNB,GaussianNB

(1) stacking 回归特征

"""
-- 回归
-- stacking 回归特征
"""
def stacking_reg(clf,train_x,train_y,test_x,clf_name,kf,label_split=None):
    train=np.zeros((train_x.shape[0],1))
    test=np.zeros((test_x.shape[0],1))
    test_pre=np.empty((folds,test_x.shape[0],1))
    cv_scores=[]
    for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)):       
        tr_x=train_x[train_index]
        tr_y=train_y[train_index]
        te_x=train_x[test_index]
        te_y = train_y[test_index]
        if clf_name in ["rf","ada","gb","et","lr"]:
            clf.fit(tr_x,tr_y)
            pre=clf.predict(te_x).reshape(-1,1)
            train[test_index]=pre
            test_pre[i,:]=clf.predict(test_x).reshape(-1,1)
            cv_scores.append(mean_squared_error(te_y, pre))
        elif clf_name in ["xgb"]:
            train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1)
            test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1)
            z = clf.DMatrix(test_x, label=te_y, missing=-1)
            params = {'booster': 'gbtree',
                      'eval_metric': 'rmse',
                      'gamma': 1,
                      'min_child_weight': 1.5,
                      'max_depth': 5,
                      'lambda': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'eta': 0.03,
                      'tree_method': 'exact',
                      'seed': 2017,
                      'nthread': 12
                      }
            num_round = 10000
            early_stopping_rounds = 100
            watchlist = [(train_matrix, 'train'),
                         (test_matrix, 'eval')
                         ]
            if test_matrix:
                model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,
                                  early_stopping_rounds=early_stopping_rounds
                                  )
                pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit).reshape(-1,1)
                train[test_index]=pre
                test_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit).reshape(-1,1)
                cv_scores.append(mean_squared_error(te_y, pre))

        elif clf_name in ["lgb"]:
            train_matrix = clf.Dataset(tr_x, label=tr_y)
            test_matrix = clf.Dataset(te_x, label=te_y)
            params = {
                      'boosting_type': 'gbdt',
                      'objective': 'regression_l2',
                      'metric': 'mse',
                      'min_child_weight': 1.5,
                      'num_leaves': 2**5,
                      'lambda_l2': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'learning_rate': 0.03,
                      'tree_method': 'exact',
                      'seed': 2017,
                      'nthread': 12,
                      'silent': True,
                      }
            num_round = 10000
            #early_stopping_rounds = 100
            #callbacks=[lightgbm.log_evaluation(period=100), lightgbm.early_stopping(stopping_rounds=100)]
            callbacks=[lightgbm.early_stopping(stopping_rounds=100)]
            if test_matrix:
                model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,
                                  #early_stopping_rounds=early_stopping_rounds
                                  callbacks=callbacks
                                  )
                pre= model.predict(te_x,num_iteration=model.best_iteration).reshape(-1,1)
                train[test_index]=pre
                test_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration).reshape(-1,1)
                cv_scores.append(mean_squared_error(te_y, pre))
        else:
            raise IOError("Please add new clf.")
        print("%s now score is:"%clf_name,cv_scores)
    test[:]=test_pre.mean(axis=0)
    print("%s_score_list:"%clf_name,cv_scores)
    print("%s_score_mean:"%clf_name,np.mean(cv_scores))
    return train.reshape(-1,1),test.reshape(-1,1)

def rf_reg(x_train, y_train, x_valid, kf, label_split=None):
    randomforest = RandomForestRegressor(n_estimators=600, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1)
    rf_train, rf_test = stacking_reg(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split)
    return rf_train, rf_test,"rf_reg"

def ada_reg(x_train, y_train, x_valid, kf, label_split=None):
    adaboost = AdaBoostRegressor(n_estimators=30, random_state=2017, learning_rate=0.01)
    ada_train, ada_test = stacking_reg(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split)
    return ada_train, ada_test,"ada_reg"

def gb_reg(x_train, y_train, x_valid, kf, label_split=None):
    gbdt = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1)
    gbdt_train, gbdt_test = stacking_reg(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split)
    return gbdt_train, gbdt_test,"gb_reg"

def et_reg(x_train, y_train, x_valid, kf, label_split=None):
    extratree = ExtraTreesRegressor(n_estimators=600, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)
    et_train, et_test = stacking_reg(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split)
    return et_train, et_test,"et_reg"

def lr_reg(x_train, y_train, x_valid, kf, label_split=None):
    lr_reg=LinearRegression(n_jobs=-1)
    lr_train, lr_test = stacking_reg(lr_reg, x_train, y_train, x_valid, "lr", kf, label_split=label_split)
    return lr_train, lr_test, "lr_reg"

def xgb_reg(x_train, y_train, x_valid, kf, label_split=None):
    xgb_train, xgb_test = stacking_reg(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split)
    return xgb_train, xgb_test,"xgb_reg"

def lgb_reg(x_train, y_train, x_valid, kf, label_split=None):
    lgb_train, lgb_test = stacking_reg(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split)
    return lgb_train, lgb_test,"lgb_reg"

(2) stacking 分类特征

"""
-- 分类
-- stacking 分类特征
"""
def stacking_clf(clf,train_x,train_y,test_x,clf_name,kf,label_split=None):
    train=np.zeros((train_x.shape[0],1))
    test=np.zeros((test_x.shape[0],1))
    test_pre=np.empty((folds,test_x.shape[0],1))
    cv_scores=[]
    for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)):       
        tr_x=train_x[train_index]
        tr_y=train_y[train_index]
        te_x=train_x[test_index]
        te_y = train_y[test_index]

        if clf_name in ["rf","ada","gb","et","lr","knn","gnb"]:
            clf.fit(tr_x,tr_y)
            pre=clf.predict_proba(te_x)
            
            train[test_index]=pre[:,0].reshape(-1,1)
            test_pre[i,:]=clf.predict_proba(test_x)[:,0].reshape(-1,1)
            
            cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1)))
        elif clf_name in ["xgb"]:
            train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1)
            test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1)
            z = clf.DMatrix(test_x)
            params = {'booster': 'gbtree',
                      'objective': 'multi:softprob',
                      'eval_metric': 'mlogloss',
                      'gamma': 1,
                      'min_child_weight': 1.5,
                      'max_depth': 5,
                      'lambda': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'eta': 0.03,
                      'tree_method': 'exact',
                      'seed': 2017,
                      "num_class": 2
                      }

            num_round = 10000
            early_stopping_rounds = 100
            watchlist = [(train_matrix, 'train'),
                         (test_matrix, 'eval')
                         ]
            if test_matrix:
                model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,
                                  early_stopping_rounds=early_stopping_rounds
                                  )
                pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit)
                train[test_index]=pre[:,0].reshape(-1,1)
                test_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit)[:,0].reshape(-1,1)
                cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1)))
        elif clf_name in ["lgb"]:
            train_matrix = clf.Dataset(tr_x, label=tr_y)
            test_matrix = clf.Dataset(te_x, label=te_y)
            params = {
                      'boosting_type': 'gbdt',
                      #'boosting_type': 'dart',
                      'objective': 'multiclass',
                      'metric': 'multi_logloss',
                      'min_child_weight': 1.5,
                      'num_leaves': 2**5,
                      'lambda_l2': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'learning_rate': 0.03,
                      'tree_method': 'exact',
                      'seed': 2017,
                      "num_class": 2,
                      'silent': True,
                      }
            num_round = 10000
            #early_stopping_rounds = 100
            #callbacks=[lightgbm.log_evaluation(period=100), lightgbm.early_stopping(stopping_rounds=100)]
            callbacks=[lightgbm.early_stopping(stopping_rounds=100)]
            if test_matrix:
                model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,
                                  #early_stopping_rounds=early_stopping_rounds
                                  callbacks=callbacks
                                  )
                pre= model.predict(te_x,num_iteration=model.best_iteration)
                train[test_index]=pre[:,0].reshape(-1,1)
                test_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration)[:,0].reshape(-1,1)
                cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1)))
        else:
            raise IOError("Please add new clf.")
        print("%s now score is:"%clf_name,cv_scores)
    test[:]=test_pre.mean(axis=0)
    print("%s_score_list:"%clf_name,cv_scores)
    print("%s_score_mean:"%clf_name,np.mean(cv_scores))
    return train.reshape(-1,1),test.reshape(-1,1)

def rf_clf(x_train, y_train, x_valid, kf, label_split=None):
    randomforest = RandomForestClassifier(n_estimators=1200, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1)
    rf_train, rf_test = stacking_clf(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split)
    return rf_train, rf_test,"rf"

def ada_clf(x_train, y_train, x_valid, kf, label_split=None):
    adaboost = AdaBoostClassifier(n_estimators=50, random_state=2017, learning_rate=0.01)
    ada_train, ada_test = stacking_clf(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split)
    return ada_train, ada_test,"ada"

def gb_clf(x_train, y_train, x_valid, kf, label_split=None):
    gbdt = GradientBoostingClassifier(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1)
    gbdt_train, gbdt_test = stacking_clf(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split)
    return gbdt_train, gbdt_test,"gb"

def et_clf(x_train, y_train, x_valid, kf, label_split=None):
    extratree = ExtraTreesClassifier(n_estimators=1200, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)
    et_train, et_test = stacking_clf(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split)
    return et_train, et_test,"et"

def xgb_clf(x_train, y_train, x_valid, kf, label_split=None):
    xgb_train, xgb_test = stacking_clf(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split)
    return xgb_train, xgb_test,"xgb"

def lgb_clf(x_train, y_train, x_valid, kf, label_split=None):
    xgb_train, xgb_test = stacking_clf(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split)
    return xgb_train, xgb_test,"lgb"

def gnb_clf(x_train, y_train, x_valid, kf, label_split=None):
    gnb=GaussianNB()
    gnb_train, gnb_test = stacking_clf(gnb, x_train, y_train, x_valid, "gnb", kf, label_split=label_split)
    return gnb_train, gnb_test,"gnb"

def lr_clf(x_train, y_train, x_valid, kf, label_split=None):
    logisticregression=LogisticRegression(n_jobs=-1,random_state=2017,C=0.1,max_iter=200)
    lr_train, lr_test = stacking_clf(logisticregression, x_train, y_train, x_valid, "lr", kf, label_split=label_split)
    return lr_train, lr_test, "lr"

def knn_clf(x_train, y_train, x_valid, kf, label_split=None):
    kneighbors=KNeighborsClassifier(n_neighbors=200,n_jobs=-1)
    knn_train, knn_test = stacking_clf(kneighbors, x_train, y_train, x_valid, "lr", kf, label_split=label_split)
    return knn_train, knn_test, "knn"

(3) 获取训练和验证数据

features_columns = [c for c in all_data_test.columns if c not in ['label', 'prob', 'seller_path', 'cat_path', 'brand_path', 'action_type_path', 'item_path', 'time_stamp_path']]
x_train = all_data_test[~all_data_test['label'].isna()][features_columns].values
y_train = all_data_test[~all_data_test['label'].isna()]['label'].values
x_valid = all_data_test[all_data_test['label'].isna()][features_columns].values

处理函数值inf以及nan，为特征工程做准备

def get_matrix(data):
    where_are_nan = np.isnan(data)
    where_are_inf = np.isinf(data)
    data[where_are_nan] = 0
    data[where_are_inf] = 0
    return data

1
2
3

x_train = np.float_(get_matrix(np.float_(x_train)))
y_train = np.int_(y_train)
x_valid = x_train

(4) 使用lgb和xgb分类模型构造stacking特征

1）使用5折交叉验证

from sklearn.model_selection import StratifiedKFold, KFold
folds = 5
seed = 1
kf = KFold(n_splits=5, shuffle=True, random_state=0)

2）选择1gb和xgb分类模型作为基模型。

# clf_list = [lgb_clf, xgb_clf, lgb_reg, xgb_reg]
# clf_list_col = ['lgb_clf', 'xgb_clf', 'lgb_reg', 'xgb_reg']

clf_list = [lgb_clf, xgb_clf]
clf_list_col = ['lgb_clf', 'xgb_clf']

3）训练模型，获取stacking特征

clf_list = clf_list
column_list = []
train_data_list=[]
test_data_list=[]
for clf in clf_list:
    train_data,test_data,clf_name=clf(x_train, y_train, x_valid, kf, label_split=None)
    train_data_list.append(train_data)
    test_data_list.append(test_data)
train_stacking = np.concatenate(train_data_list, axis=1)
test_stacking = np.concatenate(test_data_list, axis=1)

xgb now score is: [2.627641465068789, 2.5421163955938, 2.4962673833575, 2.393391320956645, 2.488317792597411]
xgb_score_list: [2.627641465068789, 2.5421163955938, 2.4962673833575, 2.393391320956645, 2.488317792597411]
xgb_score_mean: 2.5095468715148295

（5）原始特征和stacking特征合并

1
2
3

# 合并所有特征
train = pd.DataFrame(np.concatenate([x_train, train_stacking], axis=1))
test = np.concatenate([x_valid, test_stacking], axis=1)

# 特征重命名
df_train_all = pd.DataFrame(train)
df_train_all.columns = features_columns + clf_list_col
df_test_all = pd.DataFrame(test)
df_test_all.columns = features_columns + clf_list_col

1 2	`# 获取数据ID以及特征标签label df_train_all['label'] = all_data_test['label']`

10 保存特征数据

1 2	`df_train_all.to_csv('./data/train_all.csv',header=True,index=False) df_test_all.to_csv('./data/test_all.csv',header=True,index=False)`

机器学习实战

#机器学习 #特征工程 #数据压缩 #用户行为分析

天猫重复购买预测-03特征工程

https://blog.966677.xyz/2023/12/07/天猫重复购买预测-03特征工程/

作者

Zhou1317fe5

发布于

2023年12月7日

许可协议

天猫重复购买预测-04模型训练、验证和评测上一篇

天猫重复购买预测-02数据探索下一篇