〆 推荐算法实战 项亮 [学习笔记二]

⚠ Just across the step beyond fear more courage.

Step Two: enviroment – python3.7

实战演练

按照书籍上所示网址,进行数据的下载:

选择older datasets里面的MovieLens 1M Dataset

下载完成后就到了实现设计里面的分类器,可是,小白的我一脸懵逼,无法继续。原理什么我都可以看懂,可是,分类器的数据源呢?

数据源从数据库中获取

于是我使用navcat工具,导入mysql,选择导入的txt文档(在每个文档的第一行手动添加上表头,对应于数据库的字段),选择每一个文件,批量导入。

完成后,突然发现,我如果已经有了结构化的数据,完全不需要再用分类器,可是直接用sql实现,与本书代码实现出发点貌似不一致,于是,认为,方向不对。

数据使用pandas

由于前面思路的错误,google搜索,MovieLens data analysis

发现文章使用pandas分析movielens数据

于是开始使用pandas

安装各种库

1
2
pip install pandas
pip install numpy
numpy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object|一个强大的N维数组对象
  • sophisticated (broadcasting) functions|精致而又复杂(传播广)的方法
  • tools for integrating C/C++ and Fortran code|继承了C/C++以及公式代码
  • useful linear algebra, Fourier transform, and random number capabilities|继承了线性代数、傅里叶转换、随机数等有用的功能

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

除了这些在科学计算上显著的使用,numpy也被用作为一个高效多维的通用数据容器。任意数据类型都可以在这里被定义。这可以帮助numpy无缝并且高效的集成各种各样的数据库

NumPy is licensed under the BSD license, enabling reuse with few restrictions.
numpy的证书是基于BSD认证的,所以它可以轻易被重用而只有很少的限制

上述内容为官网对numpy的解释,由此可知,我们使用numpy进行科学计算。类似最简单的获取随机数等。

pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

pandas是一个开源的,基于BSD证书的python程序语言库,他提供了一个高性能、易用的数据结构和数据分析工具

pandas is a NumFOCUS sponsored project. This will help ensure the success of development of pandas as a world-class open-source project, and makes it possible to donate to the project.

pandas是一个由NumFOCUS赞助的一个项目,这将有助于确保pandas成功发展为世界一流的开源项目,并使得捐赠成为可能。

pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. These methods perform significantly better (in some cases well over an order of magnitude better) than other open source implementations (like base::merge.data.frame in R). The reason for this is careful algorithmic design and the internal layout of the data in DataFrame.

pandas具有功能齐全、高性能的内存连接操作,这与我们惯用的SQL之类的关系数据库非常相似。这些方法性能要比其他(如base:: sum .data.frame in R)的开源实现好得多(在某些情况:比如数量级纬度)。其原因是他的精心算法设计和DataFrame中数据的内部布局。

See the cookbook for some advanced strategies.

Users who are familiar with SQL but new to pandas might be interested in a comparison with SQL.

pandas provides a single function, merge(), as the entry point for all standard database join operations between DataFrame objects:

使用pandas转换数据结构

收集原始数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd

# the prefix of url
url_prefix = 'E:/math/demoOne/util/dataset/rawData/'

# read csv
def readCsv(path, sep, cols, coding):
if(coding):
return pd.read_csv(path,
sep=sep, names=cols, encoding=coding)

return pd.read_csv(path,
sep=sep, names=cols, encoding='latin-1', engine='python')

# read data from users.dat
user_cols = ['user_id', 'gender', 'age', 'occupation', 'zip_code']
users = readCsv(url_prefix + 'users.dat', '::', user_cols, None)

# read data form movies.dat
movie_cols = ['movie_id', 'title', 'genre']
movies = readCsv(url_prefix + 'movies.dat', '::', movie_cols, None)

# read data form ratings.dat
rating_cols = ['user_id', 'movie_id', 'rating', 'time_stamp']
ratings = readCsv(url_prefix + 'ratings.dat', '::', rating_cols, None)

print("*******users*******", "\n", users, "\n", "*******movies*******", "\n",
movies, "\n", "*******ratings*******", "\n", ratings)

打印后就可以看到原始数据的基本结构。

Note 1: readCsv方法中engine=’python’可以不填写,不填写时候默认使用C引擎,速度比python快,但是支持范围上没有python引擎强大,比如分析“::”,可能会产生错误提示

Note 2: url_prefix为我的数据源所在的文件夹绝对路径

merge 数据(数据合并)
1
2
3
4
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
  • left: A DataFrame object.
  • right: Another DataFrame object.
  • on: Column or index level names to join on. Must be found in both the left and right DataFrame objects. If not passed and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys.
  • left_on: Columns or index levels from the left DataFrame to use as keys. Can either be column names, index level names, or arrays with length equal to the length of the DataFrame.
  • right_on: Columns or index levels from the right DataFrame to use as keys. Can either be column names, index level names, or arrays with length equal to the length of the DataFrame.
  • left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.
  • right_index: Same usage as left_index for the right DataFrame
  • how: One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. See below for more detailed description of each method.
  • sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases.
  • suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to (‘_x’, ‘_y’).
  • copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.
  • indicator: Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in ‘left’ DataFrame, right_only for observations whose merge key only appears in ‘right’ DataFrame, and both if the observation’s merge key is found in both.
  • validate : string, default None. If specified, checks if merge is of specified type.
    • “one_to_one” or “1:1”: checks if merge keys are unique in both left and right datasets.
    • “one_to_many” or “1:m”: checks if merge keys are unique in left dataset.
    • “many_to_one” or “m:1”: checks if merge keys are unique in right dataset.
    • “many_to_many” or “m:m”: allowed, but does not result in checks.

具体示例如图所示:



看懂之后,我们将之前的movies和ratings组合起来

1
2
3
4
# movies and ratings
movie_ratings = pd.merge(movies, ratings)
# movie_ratings and users
lens = pd.merge(movie_ratings, users)

数据分析

获取电影评分top25

Top 25 most rated movies

1
2
most_rated = lens.groupby('title').size().sort_values(ascending=False)[:25]
print(most_rated)

其中表示内容为: 按照title排序,并记录出现的次数,按照从大到小排列,取前25个

获得同样结果的更简单的写法:

1
lens.title.value_counts()[:25]

至此,回归主题,用分类器对数据进行分类

说道分类器又是一些长篇大论,话不多说,先安装一个机器学习的python类库

1
pip install scikit-learn

然后,继续回归主题,项亮的书籍,先把整章看完,在来统一结合成一个完整的代码。

真孒今将命


此致: 敬礼!

送赵法师还蜀因名山奠简

作者: 李隆基

摘自: 《全唐诗》

道家奠灵简, 自昔仰神仙

真孒今将命, 苍生福可传

江山寻故国, 城郭信依然

二室遥相望, 云回洞里天

座右铭: 进化是形成我们的身体形状和我们内在本能的主要力量, 他赋予我们大脑和学习机制,使我们可以根据经验实现自我更新。我们还需要终生学习,以改变我们的行为,从而适应包括进化论还不能预测的和可预测的各种环境。
Evolution is the major force that defines our bodily shape as well as our built-in instincts and reflexes. We also learn to change our behavior during our lifetime. This helps us cope with changes in the environment that cannot be predicted by evolution. Organisms that have a short life in a well-defined environment may have all their behavior built-in, but instead of hardwiring into us all sorts of behavior for any circumstance that we could encounter in our life, evolution gave us a large brain and a mechanism to learn, such that we could update ourselves with experience and adapt to different environments.