该文章持续更新中…

环境声明

系统: Windows10家庭中文版
硬件: 16G内存、8核CPU
Py版本: 3.7.6
Pandas版本: 1.3.5
Matplotlib版本: 3.3.4
强烈推荐学习视频: Python自动化办公社区
视频配套资料: 语雀-Python自动化办公

环境部署

安装Pandas

1 2	# 使用清华源加速下载 pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas

安装openpyxl

1 2	# 使用清华源加速下载 pip3 install pip install -i https://pypi.tuna.tsinghua.edu.cn/simple openpyxl

安装matplotlib

1 2	# 使用清华源加速下载 pip3 install pip install -i https://pypi.tuna.tsinghua.edu.cn/simple matplotlib

基本操作

创建文件

语法

写入数据时记得把目标文件关闭
to_excel()方法指定index=Flase时不写入索引到表格

import pandas as pd

df = pd.DataFrame({
    '列名': [数据,数据,数据], 
    '列名': [数据,数据,数据]
})
df = df.set_index(列名)		# 指定索引为某个列
df.to_excel(存储位置.xlsx)

不写入索引

执行成功后打开out.xlsx文件就可以看到写入的内容
指定索引为某列时，导出后表格中不会出现索引
如果不指定索引为某列且不想导出索引时，可以设置to_excel()方法的index参数为False

import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Tim', 'Tom', 'Jack']
})
df = df.set_index('ID')		# 指定索引为ID列
df.to_excel('1.xlsx')       # 导出到指定位置
print(df)
print("Success!")


# 结果
"""
    Name
ID      
1    Tim
2    Tom
3   Jack
Success!
"""

写入索引

未指定索引字段时，会自动创建一列作为索引
或者未设置to_excel()方法的index参数为False

import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Tim', 'Tom', 'Jack']
})
df.to_excel('1.xlsx')       # 导出到指定位置
print(df)
print("Success!")


# 结果
"""
   ID  Name
0   1   Tim
1   2   Tom
2   3  Jack
Success!
"""

读取文件

语法

read_excel()方法中指定表头行数，默认表头为第0行 (索引从0开始)
当第0行存在脏数据时，可以指定header参数为其他行，将其他行作为表头
当表格没有表头时，可以指定header参数为None，表示不指定表头，且可使用columns指定表头内容

import pandas as pd

# 读取文件及常用参数
users = pd.read_excel(
    "xx.xlsx",  # 读取的文件
    header=0,  # 指定表头位置
    index_col=列,  # 指定索引列
    skiprows=行,  # 跳过的行数
    usecols="列:列",  # 读取的列范围，如B:E
    dtype={"列": 数据类型})  # 指定列的数据类型，如{"a":int}


# 属性
users.shape    		# 行列数 (行,列)
users.columns  		# 表头(默认第一行列名)
users.head(n)  		# 前n行数据，默认为5
users.tail(n)  		# 后n行数据，默认为5
users.columns = [列名,列名]		# 指定表头名，索引列不可修改，且需忽略该列

常规读取

import pandas as pd

users = pd.read_excel('1.xlsx')
users.set_index('ID', inplace=True)  # 指定索引列，且不生成新索引

print(users.shape)  # 打印行列数
print(users.columns)  # 打印表头(第一行的字段名)
print(users.head(2))  # 打印前2行数据
print("-----------------------------")
print(users.tail(2))  # 打印后2行数据
print("-----------------------------")
print(users)  # 打印全部数据

# 结果
(3, 1)
Index(['Name'], dtype='object')
   Name
ID     
1   Tim
2   Tom
-----------------------------
    Name
ID      
2    Tom
3   Jack
-----------------------------
    Name
ID      
1    Tim
2    Tom
3   Jack

实验表数据

跳过空行空列

当需要的数据不在左上角时，可以指定skiprows参数跳过N行，执行usecols参数读取指定范围的列

import pandas as pd

users = pd.read_excel('2.xlsx', skiprows=3, usecols="C:E")
print(users)


# 结果
"""
   id   user  pass
0   1  admin   123
1   2  guest   321
2   3  user1   132
"""

实验表数据

跳过脏数据

默认第一行为表头，当第一行不为表头时可以设置其他行为表头

import pandas as pd

users = pd.read_excel('4.xlsx', header=1)	# 指定表头为第2行(索引从0开始)
print(users)


# 结果
"""
   id   user  pass
0   1  admin   123
1   2  guest   321
2   3  user1   132
"""

# 直接读取时
"""
  sadasd gfdgdf   dsd
0     id   user  pass
1      1  admin   123
2      2  guest   321
3      3  user1   132
"""

不指定表头

不指定表头时默认表头为数字

import pandas as pd

users = pd.read_excel('4.xlsx', header=None)
print(users)


# 结果
"""
   0      1    2
0  1  admin  123
1  2  guest  321
2  3  user1  132
"""

# 自定义表头

import pandas as pd

users = pd.read_excel('4.xlsx', header=None)
users.columns = ['id', 'user', 'pass']      # 指定表头
print(users)


# 结果
"""
   id   user  pass
0   1  admin   123
1   2  guest   321
2   3  user1   132
"""

数据序列

语法

多行数据索引不一致时，值自动设置为NaN

import pandas as pd

s1 = pd.Series([数据,数据], index=[索引,索引], name=[列名,列名])	# 生成序列
s1[索引]		# 取出序列中指定索引的值

创建序列

import pandas as pd

l1 = [100, 200, 300]
l2 = ['x', 'y', 'z']
l3 = {'x': 100, 'y': 200, 'z': 300}

s1 = pd.Series(l1, index=l2)        # 创建序列，指定索引为l2值
s2 = pd.Series(l3)                  # 使用字典方法创建，键为索引、值为数据
print(s1)
print("-------------")
print(s2)
print("-------------")
print(s2['x'])


# 结果 (数据类型为int)
"""
x    100
y    200
z    300
dtype: int64
-------------
x    100
y    200
z    300
dtype: int64
-------------
100
"""

示例2

index就相当于行号，name就相当于列名
先创建序列，再把序列放到DataFrame中

import pandas as pd

s1 = pd.Series([1, 2, 3], index=[1, 2, 3], name='A')
s2 = pd.Series([10, 20, 30], index=[1, 2, 3], name='B')
s3 = pd.Series([100, 200, 300], index=[1, 2, 3], name='C')
df = pd.DataFrame({s1.name: s1, s2.name: s2, s3.name: s3})
print(df)

# 结果
"""
   A   B    C
1  1  10  100
2  2  20  200
3  3  30  300
"""

示例3

当索引与其他序列不一样时，单元格内容为NaN

import pandas as pd

s1 = pd.Series([1, 2, 3], index=[1, 2, 3], name='A')
s2 = pd.Series([10, 20, 30], index=[1, 2, 3], name='B')
s3 = pd.Series([100, 200, 300], index=[2, 3, 4], name='C')
df = pd.DataFrame({s1.name: s1, s2.name: s2, s3.name: s3})
print(df)
# df.to_excel('out.xlsx')		# 导出

# 结果
"""
     A     B      C
1  1.0  10.0    NaN
2  2.0  20.0  100.0
3  3.0  30.0  200.0
4  NaN   NaN  300.0
"""

修改数据

语法

import pandas as pd

users = pd.read_excel('读取的文件')
users[列名].at[索引] = 值		# 方法1: 先拿到Series值再修改
users.at[索引, 列名] = 值		# 方法2: 使用DataFrame找到单元格再修改

方法1

import pandas as pd

users = pd.read_excel('3.xlsx')
print("---------原数据---------")
print(users)

users['id'].at[0] = 100
users['id'].at[1] = 200
users['id'].at[2] = 300

print("\n---------修改后---------")
print(users)

# 结果
"""
---------原数据---------
   id   user  pass  birthday
0 NaN  admin   123       NaN
1 NaN  guest   321       NaN
2 NaN  user1   132       NaN

---------修改后---------
      id   user  pass  birthday
0  100.0  admin   123       NaN
1  200.0  guest   321       NaN
2  300.0  user1   132       NaN
"""

方法2

import pandas as pd

users = pd.read_excel('3.xlsx')
print("---------原数据---------")
print(users)

users.at[0, 'id'] = 100
users.at[1, 'id'] = 200
users.at[2, 'id'] = 300

print("\n---------修改后---------")
print(users)


# 结果
"""
---------原数据---------
   id   user  pass  birthday
0 NaN  admin   123       NaN
1 NaN  guest   321       NaN
2 NaN  user1   132       NaN

---------修改后---------
      id   user  pass  birthday
0  100.0  admin   123       NaN
1  200.0  guest   321       NaN
2  300.0  user1   132       NaN
"""

自动填充

填充数字

可以利用for循环来自动填充数据
ID的默认数据类型为float，需要修改为int类型
但单元格中存在空值，所以不能转化为Int类型，可以先转换为字符型

import pandas as pd

users = pd.read_excel('3.xlsx', dtype={'id': str})
print("---------原数据---------")
print(users)

for i in users.index:
    users['id'].at[i] = i+1

print("\n---------修改后---------")
print(users)


# 结果
"""
---------原数据---------
    id   user  pass  birthday
0  NaN  admin   123       NaN
1  NaN  guest   321       NaN
2  NaN  user1   132       NaN

---------修改后---------
  id   user  pass  birthday
0  1  admin   123       NaN
1  2  guest   321       NaN
2  3  user1   132       NaN
"""

填充日期

原理同上，在月份运算上还有一个bug可以优化

import pandas as pd
from datetime import date, timedelta

users = pd.read_excel('3.xlsx', dtype={'id': str})
print("---------原数据---------")
print(users)


def add_month(d, md):
    """运算时间，添加月份"""
    yd = md // 12
    m = d.month + md % 12
    if m != 12:
        yd += m // 12
        m = m % 12
    return date(d.year + yd, m, d.day)


# 自动填充数据
start_time = date(2022, 1, 1)  # 起始时间
for i in users.index:
    users['id'].at[i] = i + 1
    # users['birthday'].at[i] = start_time + timedelta(days=i)      # 日期递增填充
    # users['birthday'].at[i] = add_month(start_time, i)            # 月份递增填充
    users['birthday'].at[i] = date(start_time.year+i, start_time.month, start_time.day)     # 年份递增填充

print("\n---------修改后---------")
print(users)


# 结果
"""
---------原数据---------
    id   user  pass  birthday
0  NaN  admin   123       NaN
1  NaN  guest   321       NaN
2  NaN  user1   132       NaN

---------修改后---------
  id   user  pass    birthday
0  1  admin   123  2022-01-01
1  2  guest   321  2023-01-01
2  3  user1   132  2024-01-01
"""

函数填充

要运算TotalPrice时，可以获取UnitPrice与Number列的数据，将两列数据相乘并赋值给TotalPrice就可以了
当单元格的值为空时，得到的结果也为空

import pandas as pd

books = pd.read_excel('4.xlsx', index_col='ID')
print(books)
print("-------------------------")
books['TotalPrice'] = books['UnitPrice'] * books['Number']      # 总价列 = 单价列 * 数量列
print(books)


# 结果
"""
      Name  UnitPrice  Number  TotalPrice
ID                                       
1   book01       10.0       1         NaN
2   book02       20.0       2         NaN
3   book03       30.0       3         NaN
4   book04       35.0       4         NaN
5   book05       10.5       5         NaN
-------------------------
      Name  UnitPrice  Number  TotalPrice
ID                                       
1   book01       10.0       1        10.0
2   book02       20.0       2        40.0
3   book03       30.0       3        90.0
4   book04       35.0       4       140.0
5   book05       10.5       5        52.5
"""

填充部分单元格

如果只需要运算部分单元格时，可以使用for循环进行迭代

import pandas as pd

books = pd.read_excel('4.xlsx', index_col='ID')
print(books)
print("-------------------------")
for i in range(2, 4):
    books['TotalPrice'].at[i] = books['UnitPrice'].at[i] * books['Number'].at[i]
print(books)


# 结果(只填充2~3行)
"""
      Name  UnitPrice  Number  TotalPrice
ID                                       
1   book01       10.0       1         NaN
2   book02       20.0       2         NaN
3   book03       30.0       3         NaN
4   book04       35.0       4         NaN
5   book05       10.5       5         NaN
-------------------------
      Name  UnitPrice  Number  TotalPrice
ID                                       
1   book01       10.0       1         NaN
2   book02       20.0       2        40.0
3   book03       30.0       3        90.0
4   book04       35.0       4         NaN
5   book05       10.5       5         NaN
"""

数据排序

语法

sort_values()方法的inplace参数为True时表示不创建新的DataFrame，而是在当前的DataFrame上操作
sort_values()方法的ascending参数为True时表示升序，为False时表示降序，默认为True
多重排序时可以对多个列进行升序或降序来排序， by中的列名对应ascending中的排序方式

import pandas as pd

books = pd.read_excel('读取的文件')
books.sort_values(by='排序列名', inplace=True, ascending=True)		# 单列排序
books.sort_values(by=['主排序列名', '次排序列名'], inplace=True, ascending=[True, False])		# 多重排序

单列排序

按照UnitPrice进行升序排序

import pandas as pd

books = pd.read_excel('4.xlsx', index_col='ID')
books.sort_values(by='UnitPrice', inplace=True)     # 按照单价升序排序
print(books)


# 结果
"""
      Name  UnitPrice Worthy
ID                          
1   book01       10.0    YES
6   book06       10.0     NO
5   book05       10.5    YES
7   book07       13.0     NO
10  book10       15.0    YES
8   book08       18.0    YES
2   book02       20.0    YES
9   book09       20.0     NO
3   book03       30.0     NO
4   book04       35.0     NO
"""

多重排序

找出Worthy为NO中UnitPrice最高的行数据
先对Worthy列按照升序排序，再对UnitPrice列进行降序排序即可

import pandas as pd

books = pd.read_excel('4.xlsx', index_col='ID')
books.sort_values(by=['Worthy', 'UnitPrice'], inplace=True, ascending=[True, False])
print(books)


# 结果
"""
      Name  UnitPrice Worthy
ID                          
4   book04       35.0     NO
3   book03       30.0     NO
9   book09       20.0     NO
7   book07       13.0     NO
6   book06       10.0     NO
2   book02       20.0    YES
8   book08       18.0    YES
10  book10       15.0    YES
5   book05       10.5    YES
1   book01       10.0    YES
"""

筛选过滤

语法

按照DataFrame中的某列进行筛选
定义一个函数用于筛选传递进来的数据，再使用apply()方法调用该函数
注意apply()方法中传递的是函数名，也可以使用lambda匿名函数

import pandas as pd

students = pd.read_excel('读取的文件')
students = students.loc[students['列名'].apply(函数名)]		# 方法1
students = students.loc[students.列名.apply(函数名)]			# 方法2

# 多级筛选
students.loc[students['列名'].apply(函数名)].loc[students['列名2'].apply(函数名)]

一级筛选

筛选出年龄在18~29之间的学生

import pandas as pd

def filter_age(a):
    """过滤年龄大于等于18且小于等于30"""
    return 18 <= a < 30


students = pd.read_excel('4.xlsx', index_col='ID')
students = students.loc[students['Age'].apply(filter_age)]
# students = students.loc[students['Age'].apply(lambda a: 18 <= a < 30)]		# 使用匿名函数

print(students)


# 结果
"""
   Name  Age  Score
ID                 
1    张三   18     89
5    陈七   20     80
6    老八   19     85
"""

多级筛选

筛选出年龄在18~29之间的学生
筛选出成绩在85~100分之间的学生
使用lambda匿名函数比较方便

import pandas as pd

students = pd.read_excel('4.xlsx', index_col='ID')
students = students.loc[students['Age'].apply(lambda a: 18 <= a < 30)]\
    .loc[students['Score'].apply(lambda s: 85 <= s <= 100)]

print(students)


# 结果
"""
   Name  Age  Score
ID                 
1    张三   18     89
6    老八   19     85
"""

数据可视化

柱图

语法

绘制图表前需要先安装matplotlib库
plot.bar()方法为垂直柱图，plot.barh()方法为水平柱图

import pandas as pd
from matplotlib import pylab as plt
from matplotlib import rc

# 指定字体为微软雅黑，否则使用中文会报错
rc("font", family='Microsoft YaHei')

subject = pd.read_excel('4.xlsx')
subject.sort_values(by='列', inplace=True)	# 可以按照指定列进行排序再生成
subject.plot.bar(x='x轴列', y='y轴列', color='颜色', title='标题')
subject.plot.bar(x='x轴列', y=['y1', 'y2'], color=['y1颜色', 'y2颜色'], title='标题')		# 多y轴
plt.show()	# 生成图表

Pandas绘图

如果需要对图表进行排序可以先对数据进行排序再生成图标
排序时要指定inplace为True，不然会生成新DataFrame

import pandas as pd
from matplotlib import pylab as plt
from matplotlib import rc

rc("font", family='Microsoft YaHei')

subject = pd.read_excel('4.xlsx', index_col='ID')
print(subject)
subject.sort_values(by='Number', inplace=True, ascending=False)     # 降序排序
subject.plot.bar(x='Subject', y='Number', color='orange', title='学生最爱的科目')
# subject.plot.bar(x='Subject', y='Number', color='orange', title='学生最爱的科目')	# 水平
plt.show()


# 数据
"""
ID	Subject	Number
1	语文	13
2	数学	16
3	英语	19
4	历史	13
5	化学	10
6	地理	15
7	生物	18

"""

垂直柱图

水平柱图

matplotlib绘图

使用matplotlib来进行绘图要相对灵活，Pandas只能绘制一些中规中矩的图

import pandas as pd
from matplotlib import pylab as plt
from matplotlib import rc

rc("font", family='Microsoft YaHei')

subject = pd.read_excel('4.xlsx', index_col='ID')
print(subject)
subject.sort_values(by='Number', inplace=True, ascending=False)     # 降序
plt.bar(subject.Subject, subject.Number)    # 设置x轴y轴数据
plt.xlabel('Subject')       # 设置x轴标签
plt.ylabel('Number')        # 设置y轴标签
plt.title('学生最爱的科目')    # 设置图表标题
plt.show()

优化柱图

语法

可以借助matplotlib来优化图表

from matplotlib import pylab as plt

plt.title('标题', fontsize=字体大小, fontweight='字体粗细')	# 指定标题及样式

plt.xlabel('x轴标签名', fontweight='粗细')      # 设置x轴标签，字体粗细
plt.ylabel('x轴标签名', fontweight='粗细')      # 设置y轴标签，字体粗细

# 旋转列标签， ha值 -> left, right, center
ax = plt.gca()
ax.set_xticklabels(DataFrame名['列名'], rotation=旋转角度, ha='原点')

# 图表的间距(空白部分)，注意top值要大于bottom值，right值要大于left值
f = plt.gcf()
f.subplots_adjust(top=值, bottom=值, left=值, right=值)     # 设置上下左右的间距

多列数据柱图

当列名为数字时，对该列数据进行排序或绘图时不要给数字的列名加引号！！！
也可以使用Pandas生成多列数据图，参考上方

import pandas as pd
from matplotlib import pylab as plt
from matplotlib import rc

rc("font", family='Microsoft YaHei')

subject = pd.read_excel('4.xlsx', index_col='ID')
subject.sort_values(by=2022, inplace=True, ascending=False)     # 降序排序
subject.plot.bar(x='Subject', y=[2021, 2022], color=['orange', 'red'])

plt.title('2021年与2022年对比图', fontweight='bold')  # 设置图表标题，字体为粗
plt.xlabel('科目', fontweight='bold')    # 设置x轴标签，字体为粗
plt.ylabel('数值')   # 设置y轴标签

ax = plt.gca()
ax.set_xticklabels(subject['Subject'], rotation=360, ha='center')   # 旋转列标签

f = plt.gcf()
f.subplots_adjust(left=0.8, bottom=1.5, top=2, right=0.9)     # 设置上下左右的间距
plt.show()


# 数据
"""
ID	Subject	2021	2022
1	语文	13	16
2	数学	16	13
3	英语	19	15
4	历史	13	18
5	化学	10	12
6	地理	15	14
7	生物	18	16

"""

叠加水平柱状图

将plot.barh()或plot.bar()方法的stacked属性设置为True就为叠加柱图

import pandas as pd
from matplotlib import pylab as plt
from matplotlib import rc

rc("font", family='Microsoft YaHei')

users = pd.read_excel('4.xlsx', index_col='ID')
users['all_number'] = users['1月'] + users['2月'] + users['3月']       # 创建一列用于统计用户所有月份的使用次数
users.sort_values(by='all_number', inplace=True)     # 按照所有月份使用次数进行升序排序
users.plot.barh(x='Name', y=['1月', '2月', '3月'], stacked=True)       # 将图标横过来(水平)， 且使用叠加柱图
plt.title('用户使用软件频率表')
plt.xlabel('数值')

plt.show()


# 数据
"""
ID	Name	1月	2月	3月
1	张三	10	8	3
2	李四	12	9	13
3	王五	6	8	10
4	老六	13	8	15
5	陈七	9	9	17
6	老八	8	10	15

"""

饼图

语法

绘制饼图时把xx值设置为索引列，再指定某列为数据列

import pandas as pd
from matplotlib import pylab as plt

subject = pd.read_excel('读取的文件', index_col='索引列')
subject['数据列'].plot.pie()
plt.show()

示例

import pandas as pd
from matplotlib import pylab as plt
from matplotlib import rc

rc("font", family='Microsoft YaHei')

subject = pd.read_excel('4.xlsx', index_col='Subject')      # 以Subject作为索引
subject['Number'].sort_values(ascending=False).plot.pie()   # 排序后生成饼图
plt.title('学生喜爱的科目比例')
plt.ylabel('数据')
plt.show()


# 数据
"""
ID	Subject	Number
1	语文	9
2	数学	11
3	英语	2
4	历史	7
5	化学	6
6	地理	13
7	生物	18

"""

折线图

语法

读取文件时指定索引列为x轴

import pandas as pd
from matplotlib import pylab as plt
from matplotlib import rc

rc("font", family='Microsoft YaHei')
app_number = pd.read_excel('读取的文件', index_col='索引列')

# 折线图
app_number.plot(y=['列名', '列名'])		# 多列数据
app_number.plot(y='列名')		# 单列数据

# 叠加区域图
app_number.plot.area(y=['列名', '列名'])	# 多列数据
app_number.plot.area(y='列名')		# 单列数据

plt.show()

折线图示例

import pandas as pd
from matplotlib import pylab as plt
from matplotlib import rc

rc("font", family='Microsoft YaHei')

app_number = pd.read_excel('4.xlsx', index_col='月份')      # 以月份作为索引
app_number.plot(y=['张三', '李四', '王五'], title="用户软件使用频率")     # 折线图
# app_number.plot.area(y=['张三', '李四', '王五'], title="用户软件使用频率")      # 叠加区域图
plt.ylabel('频率')
plt.xlabel('月份')
plt.show()


# 数据(直接复制黏贴到Excel)
"""
月份	张三	李四	王五
1	10	3	11
2	15	5	17
3	13	2	9
4	18	7	15
5	9	13	13
6	7	10	16
7	13	20	9
8	15	15	7
9	6	7	11
10	18	12	13
11	14	9	8
12	11	5	7

"""

散点图

语法

import pandas as pd
from matplotlib import pylab as plt

homes = pd.read_excel('读取的文件')
homes.plot.scatter(x='x轴数据', y='y轴数据')
plt.show()

绘制散点图

—>下载实验数据<—

import pandas as pd
from matplotlib import pylab as plt

# pd.options.display.max_columns = 777    # 设置最大长度(让所有数据展示出来)
homes = pd.read_excel('home_data.xlsx')
homes.plot.scatter(x='price', y='sqft_living')		# 价格与面积的散点图
# print(homes.head())
plt.show()

直方图

语法

import pandas as pd
from matplotlib import pylab as plt

homes = pd.read_excel('读取的文件')
homes['列名'].plot.hist(bins=分布区间大小)
plt.show()

绘制直方图

实验数据同上

import pandas as pd
from matplotlib import pylab as plt

homes = pd.read_excel('home_data.xlsx')
homes['sqft_living'].plot.hist(bins=100)        # 将分布区间设为100(效果会比较细腻)
plt.show()

多表联合

语法

这里的多表指的是一个xlsx文件里有多个表，而不是多个xlsx文件
内连接取多表的共同信息字段
左连接取左表的所有信息，右表与左表的共同信息
右连接取右表的所有信息，左表与右表的共同信息

import pandas as pd

a = pd.read_excel('读取的文件', sheet_name='表名')
b = pd.read_excel('读取的文件', sheet_name='表名')

# 多表连接方式
table = 左表.merge[右表, on='联立列']	# 内连接
table = 左表.merge[右表, how='left', on='联立列']	# 左连接
table = 左表.merge[右表, how='right', on='联立列']	# 右连接
table = 左表.merge[右表, left_on='左联立列', right_on='右联立列']	# 联立列名不一样时

实验数据

# Students表
ID	Name
1	张三
3	李四
5	王五
7	老六
9	陈七
10	老八
11	李雷
12	韩梅梅

# Scores表
ID	Score
1	80
2	75
3	78
4	83
5	92
6	81
7	79
8	81
9	88
11	73
13	85
14	84
15	77

内连接

只取Students表与Scores表ID列共同值的数据

import pandas as pd

students = pd.read_excel('4.xlsx', sheet_name='Students')		# 表1
scores = pd.read_excel('4.xlsx', sheet_name='Scores')		# 表2
table = students.merge(scores, on='ID')		# 内连接
print(table)


# 结果
"""
   ID Name  Score
0   1   张三     80
1   3   李四     78
2   5   王五     92
3   7   老六     79
4   9   陈七     88
5  11   李雷     73
"""

左连接

取Students表的所有数据，与Scores表与Students表ID列共同值的数据

import pandas as pd

students = pd.read_excel('4.xlsx', sheet_name='Students')
scores = pd.read_excel('4.xlsx', sheet_name='Scores')

table = students.merge(scores, how='left', on='ID').fillna(0)   # 左连接并将空值设置为0
table.Score = table['Score'].astype(int)    # 将Score列数据设置为int类型，默认为浮点型
print(table)


# 结果
"""
   ID Name  Score
0   1   张三     80
1   3   李四     78
2   5   王五     92
3   7   老六     79
4   9   陈七     88
5  10   老八      0
6  11   李雷     73
7  12  韩梅梅      0
"""

右连接

取Scores表的所有数据，与Scores表与Students表ID列共同值的数据

import pandas as pd

students = pd.read_excel('4.xlsx', sheet_name='Students')
scores = pd.read_excel('4.xlsx', sheet_name='Scores')

table = students.merge(scores, how='right', on='ID').fillna('NULL')   # 右连接并将空值设置为NULL
table.Score = table['Score'].astype(int)    # 将Score列数据设置为int类型，默认为浮点型
print(table)


# 结果
"""
    ID  Name  Score
0    1    张三     80
1    2  NULL     75
2    3    李四     78
3    4  NULL     83
4    5    王五     92
5    6  NULL     81
6    7    老六     79
7    8  NULL     81
8    9    陈七     88
9   11    李雷     73
10  13  NULL     85
11  14  NULL     84
12  15  NULL     77
"""

数据校验

语法

先定义一个函数用来校验数据，再使用apply()方法调用该函数进行校验
apply()方法的asix轴参数为0时表示从上到下(一列一列)校验，参数为1时表示从左到右(一行一行)校验

import pandas as pd

students = pd.read_excel('读取的文件')
students.apply(函数名, axis=值)

抓出错误数据

成绩列正确范围在0~100之间，如果成绩列存在错误数据就抓出来

import pandas as pd

def score_validation(row):
    """成绩校验"""

    # 使用assert来校验
    try:
        assert 0 <= row['Score'] <= 100
    except:
        print(f"{row['ID']} 列的学生{row['Name']}成绩格式错误 --> {row['Score']}")

    # 使用if来校验
    # if not 0 <= row['Score'] <= 100:
    #     print(f"{row['ID']} 列的学生{row['Name']}成绩格式错误 --> {row['Score']}")


students = pd.read_excel('5.xlsx')
students.apply(score_validation, axis=1)        # 轴为1从左到右(一行一行)



# 结果
"""
1 列的学生张三成绩格式错误 --> -5.0
4 列的学生老六成绩格式错误 --> 111.0
6 列的学生老八成绩格式错误 --> 120.0
"""


# 数据
"""
ID	Name	Score
1	张三	-5
2	李四	75
3	王五	78
4	老六	111
5	陈七	92
6	老八	120
7	李雷	79.5
8	韩梅梅	0
"""

数据分割

语法

分隔符不指定时默认为空格或换行符，可以指定安州某些符号来进行分割
参数expand为True时表示分割成多列，默认为False

import pandas as pd

students = pd.read_excel('读取的文件')
temp_df = students['列名'].str.split('分割符', expand=True)

分割示例

import pandas as pd

students = pd.read_excel('5.xlsx', index_col='ID')
temp_df = students['Name'].str.split(expand=True)   # 创建一个DataFrema存储按照空格分割后的姓名
students['姓氏'] = temp_df[0]     # 创建列来存储分割的数据
students['名字'] = temp_df[1]
print(students)


# 结果
"""
    Name 姓氏  名字
ID             
1    张 三  张   三
2    李 四  李   四
3    王 五  王   五
4    老 六  老   六
5    陈 七  陈   七
6    老 八  老   八
7    李 雷  李   雷
8   韩 梅梅  韩  梅梅
"""


# 数据(注意姓名之间有个空格)
"""
ID	Name
1	张 三
2	李 四
3	王 五
4	老 六
5	陈 七
6	老 八
7	李 雷
8	韩 梅梅
"""

统计函数

语法

先创建一个临时的DataFrame，将需要运算的数据以列表的方式指定
再调用DataFrame的sum()或mean()方法一列一列或一行一行的求总分或平均分
再将得到的结果写回原来的DataFrame中

import pandas as pd

students = pd.read_excel('读取的文件')
temp_row = students[['列1', '列2', '列3']]     # 创建一个临时的DataFrema，存储临时数据
row_sum = temp_row.sum(axis=1)      # 个人总分数(一行一行的)
row_mean = temp_row.mean(axis=1)    # 个人平均分

统计示例

求每个人的三次测试中的总分与平均分
求每次测试的平均分

import pandas as pd

students = pd.read_excel('5.xlsx', index_col='ID')
temp_row = students[['Test_1', 'Test_2', 'Test_3']]     # 创建一个临时的DataFrema，只存储成绩
row_sum = temp_row.sum(axis=1)      # 个人总分数(一行一行的)
row_mean = temp_row.mean(axis=1)    # 个人平均分

# 将数据写回students
students['总分'] = row_sum
students['平均分'] = row_mean
# print(students)

# 计算班级平均分
col_mean = students[['Test_1', 'Test_2', 'Test_3', '总分', '平均分']].mean(axis=0)   # 一列一列求平均分
col_mean['Name'] = '平均分'
students = students.append(col_mean, ignore_index=True)     # 把运算结果写回students，放在最后一行
print(students)


# 结果
  Name  Test_1  Test_2  Test_3     总分        平均分
0   张三  80.000    88.0  85.000  253.0  84.333333
1   李四  75.000    73.0  80.000  228.0  76.000000
2   王五  78.000    85.0  88.000  251.0  83.666667
3   老六  83.000    84.0  79.000  246.0  82.000000
4   陈七  92.000    77.0  85.000  254.0  84.666667
5   老八  81.000    79.0  89.000  249.0  83.000000
6   李雷  79.000    82.0  88.000  249.0  83.000000
7  韩梅梅  81.000    80.0  85.000  246.0  82.000000
8  平均分  81.125    81.0  84.875  247.0  82.333333


# 数据
"""
ID	Name	Test_1	Test_2	Test_3
1	张三	80	88	85
2	李四	75	73	80
3	王五	78	85	88
4	老六	83	84	79
5	陈七	92	77	85
6	老八	81	79	89
7	李雷	79	82	88
8	韩梅梅	81	80	85
"""

数据清洗

语法

如果需要剔除多个列的重复数据可以修改drop_duplicates()方法中的subset参数为列表
drop_duplicates()方法的keep参数为first时表示保留最先出现的数据(默认)，为last时表示保留最后出现的数据
duplicated()方法返回某行为False时表示该行数据不重复，返回True表示该行数据为重复数据

import pandas as pd

students = pd.read_excel('读取的文件')
students.drop_duplicates(subset='重复的列', keep='保留的数据位置')		# 删除指定列的重复数据
students.duplicated(subset='重复的列')		# 返回索引与行是否重复

students.isnull()        # 返回单元格是否为空
students.dropna(inplace=True)    # 删除表格所有空值
students.dropna(subset=['列名', '列名'])	# 删除指定列的空值

students.fillna(值)        		 # 将所有空值替换为指定值
students['列名'].fillna(值)        # 将所有空值替换为指定值

清除重复数据

import pandas as pd

students = pd.read_excel('5.xlsx', index_col='ID')
students.drop_duplicates(subset='Name', inplace=True)	# 删除Name列重复的数据
print(students)


# 结果
"""
   Name  Test_1  Test_2  Test_3
ID                             
1    张三      80      88      85
2    李四      75      73      80
3    王五      78      85      88
4    老六      83      84      79
5    陈七      92      77      85
6    老八      81      79      89
7    李雷      79      82      88
8   韩梅梅      81      80      85
"""

# 数据(后面的为重复数据)
"""
ID	Name	Test_1	Test_2	Test_3
1	张三	80	88	85
2	李四	75	73	80
3	王五	78	85	88
4	老六	83	84	79
5	陈七	92	77	85
6	老八	81	79	89
7	李雷	79	82	88
8	韩梅梅	81	80	85
9	老八	81	79	89
10	张三	80	88	85
11	老六	83	84	79

取出重复数据

import pandas as pd

students = pd.read_excel('5.xlsx', index_col='ID')
temp = students.duplicated(subset='Name')

if not temp.any():
    print("数据无重复")
else:
    print("数据有重复")
    temp = temp[temp]  # 取出重复的数据，即temp为true的值
    # temp = temp[temp == true]		# 同上
    print(students.loc[temp.index])


# 结果
"""
数据有重复
   Name  Test_1  Test_2  Test_3
ID                             
9    老八      81      79      89
10   张三      80      88      85
11   老六      83      84      79
"""

空数据处理

import pandas as pd

students = pd.read_excel('5.xlsx', index_col='ID')
# print(students.isnull())        # 返回单元格是否为空
students.dropna(inplace=True)    # 删除所有列的空值
# students.dropna(subset=['Test_1'], inplace=True)    # 删除指定列的空值
# students.fillna(0, inplace=True)        # 将所有空值替换为0

print(students)


# 结果
"""
   Name  Test_1  Test_2  Test_3
ID                             
1    张三    80.0    88.0    85.0
2    李四    75.0    73.0    80.0
4    老六    83.0    84.0    79.0
6    老八    81.0    79.0    89.0
"""

# 数据
"""
ID	Name	Test_1	Test_2	Test_3
1	张三	80	88	85
2	李四	75	73	80
3	王五	78	85	
4	老六	83	84	79
5	陈七	92		85
6	老八	81	79	89
7	李雷		82	88
8	韩梅梅	81	80	
"""

读取数据

语法

csv或txt文件可以使用excel另存为时选择

1
2
3

import pandas as pd

pd.read_csv('csv文件或txt文件', sep='分割符')

读取csv文件

可以创建一个txt文件，把下面的数据复制保存后，将后缀改成.csv

import pandas as pd

students = pd.read_csv('1111.csv')
print(students)


# 结果
"""
   ID Name  Test_1  Test_2  Test_3
0   1   张三      80      88      85
1   2   李四      75      73      80
2   3   王五      78      85      88
3   4   老六      83      84      79
4   5   陈七      92      77      85
5   6   老八      81      79      89
6   7   李雷      79      82      88
7   8  韩梅梅      81      80      85
"""


# 数据
"""
ID,Name,Test_1,Test_2,Test_3
1,张三,80,88,85
2,李四,75,73,80
3,王五,78,85,88
4,老六,83,84,79
5,陈七,92,77,85
6,老八,81,79,89
7,李雷,79,82,88
8,韩梅梅,81,80,85
"""

读取txt文件

可以创建一个txt文件，把下面的数据复制保存

import pandas as pd

students = pd.read_csv('5.txt', sep='|')	# 指定分割符为|
print(students)


# 结果
同上

# 数据
"""
ID|Name|Test_1|Test_2|Test_3
1|张三|80|88|85
2|李四|75|73|80
3|王五|78|85|88
4|老六|83|84|79
5|陈七|92|77|85
6|老八|81|79|89
7|李雷|79|82|88
8|韩梅梅|81|80|85
"""

数据着色

语法

import pandas as pd

students = pd.read_excel('读取的文件')
students.style.applymap(函数名, subset=['列1', '列2'])	# 逐个元素
students.style.apply(函数名, subset=['列1', '列2'])		# 行列表方式

示例

注意在pycharm中并不会标记出颜色，可以使用Jupyter来写代码
将分数小于80分的文本标注成红色，每次分数最高的单元格标注成绿色

import pandas as pd


def low_score_red(s):
    """分数小于80分的标注成红色"""
    color = 'red' if s < 80 else 'bleak'
    return f"color:{color}"


def high_score_blue(col):
    """分数最高的标注成绿色"""
    return ['background-color:lime' if s == col.max() else 'background-color:white' for s in col]

students = pd.read_excel('D:\\Project\\python\\pandas\\5.xlsx', index_col='ID')
students.style.applymap(low_score_red, subset=['Test_1', 'Test_2', 'Test_3'])\
    .apply(high_score_blue, subset=['Test_1', 'Test_2', 'Test_3'])


# 数据
"""
ID	Name	Test_1	Test_2	Test_3
1	张三	80	88	85
2	李四	75	73	80
3	王五	78	85	88
4	老六	83	84	79
5	陈七	92	77	85
6	老八	81	79	89
7	李雷	79	82	88
8	韩梅梅	81	80	85
"""