05丨Python科学计算：Pandas

陈旸



该思维导图由 AI 生成，仅供参考

上一章中，我们讲了 Python 的一个重要的第三方库 NumPy，今天我来给你介绍 Python 的另一个工具 Pandas。
在数据分析工作中，Pandas 的使用频率是很高的，一方面是因为 Pandas 提供的基础数据结构 DataFrame 与 json 的契合度很高，转换起来就很方便。另一方面，如果我们日常的数据清理工作不是很复杂的话，你通常用几句 Pandas 代码就可以对数据进行规整。
Pandas 可以说是基于 NumPy 构建的含有更高级数据结构和分析能力的工具包。在 NumPy 中数据结构是围绕 ndarray 展开的，那么在 Pandas 中的核心数据结构是什么呢？
下面主要给你讲下 Series 和 DataFrame 这两个核心数据结构，他们分别代表着一维的序列和二维的表结构。基于这两种数据结构，Pandas 可以对数据进行导入、清洗、处理、统计和输出。
数据结构：Series 和 DataFrameSeries 是个定长的字典序列。说是定长是因为在存储的时候，相当于两个 ndarray，这也是和字典结构最大的不同。因为在字典的结构里，元素的个数是不固定的。
Series 有两个基本属性：index 和 values。在 Series 结构中，index 默认是 0,1,2,……递增的整数序列，当然我们也可以自己来指定索引，比如 index=[‘a’, ‘b’, ‘c’, ‘d’]。

公开

同步至部落

取消

完成

0/2000

荧光笔

直线

曲线

笔记

复制

AI

深入了解
翻译
英语
中文简体
中文繁体
法语
德语
日语
韩语
俄语
西班牙语
阿拉伯语
解释
总结

Python科学计算中的重要工具Pandas提供了Series和DataFrame两种核心数据结构，分别类似于定长的字典序列和数据库表。Pandas能够方便地进行数据导入、清洗、处理、统计和输出，同时提供了丰富的数据清洗工具和统计函数。通过Pandas，用户可以快速了解数据并进行相应的处理和分析。此外，Pandas还允许直接从xlsx、csv等文件中导入数据，也可以输出到这些文件，非常方便。文章还介绍了Pandas中的数据表合并和使用SQL方式打开Pandas的方法。总之，Pandas在数据分析和处理方面具有很高的实用价值，是Python科学计算中不可或缺的工具之一。

仅可试看部分内容，如需阅读全部内容，请付费购买文章所属专栏
《数据分析实战 45 讲》，新⼈⾸单¥59

立即购买

登录后留言

全部留言(201)

最新
精选

何楚
#!/usr/bin/env python3 # -*- coding: utf-8 -*- import pandas as pd data = {'Chinese': [66, 95, 93, 90, 80, 80], 'English': [65, 85, 92, 88, 90, 90], 'Math': [None, 98, 96, 77, 90, 90]} df = pd.DataFrame(data, index=['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'], columns=['English', 'Math', 'Chinese']) # 去除重复行 df = df.drop_duplicates() # 列名重新排序 cols = ['Chinese', 'English', 'Math'] df = df.filter(cols, axis=1) # 列名改为中文 df.rename(columns={'Chinese': '语文', 'English': '英语', 'Math': '数学'}, inplace=True) def total_score(df): df['总分'] = df['语文'] + df['英语'] + df['数学'] return df # 求成绩的和，用老师讲的 apply 方法 df = df.apply(total_score, axis=1) # 或者可以用这个方法求和 # df['总分'] = df['语文'] + df['英语'] + df['数学'] # 按照总分排序，从高到低，此时有缺失值 df.sort_values(['总分'], ascending=[False], inplace=True) # 打印显示成绩单信息，张飞有空值 print(df.isnull().sum()) print(df.describe()) print(df) # 使用数学成绩均值填充张飞同学的缺失值 df['数学'].fillna(df['数学'].mean(), inplace=True) # 再次求成绩的和并打印显示成绩单情况 df = df.apply(total_score, axis=1) print(df.isnull().sum()) print(df.describe()) print(df)
作者回复: 整理的不错，终于看到张飞的数学成绩按照平均值来补全的了
2018-12-24
6
88
daydreamer
""" Pandas中有Series和DataFrame两种重要的数据结构。 Series：是一个定长的字典序列。有两个基本属性：index，values DataFrame：类似于数据库表的一种数据结构。我们甚至可以像操作数据库表那样对DataFrame数据进行连接，合并，查询等等常用DataFrame进行数据清晰：用到的发方法有: 去重删除：drop()，drop_duplicates(),rename() 去空格：strip(),lstrip(),rstrip() 变换大小写：upper(),lower(),title() 改变数据格式：astype() 查找空值：lsnull apply """ from pandas import DataFrame # Scores of students scores = {'Chinese': [66, 95, 95, 90, 80, 80], 'English': [65, 85, 92, 80, 90, 90], 'Math': [None, 98, 96, 77, 90, 90], 'Total': [None, None, None, None, None, None]} df = DataFrame(scores, index=['Zhang Fei', 'Guan Yu', 'Zhao Yun', 'Huang Zhong', 'Dian Wei','Dian Wei'],) # Data ckeaning. # remove the duplicated record. df = df.drop_duplicates() # print(df) # Calculate the total scores. df['Total'] = df.sum(axis=1) print(df.describe())
作者回复: df['Total'] = df.sum(axis=1) 这个求和写的还是挺简洁的
2018-12-24
2
21
知悉者也
stu_score = pd.DataFrame([['张飞', 66, 65, np.nan], ['关羽', 95, 85, 98], ['赵云', 95, 92, 96], ['黄忠', 90, 88, 77], ['典韦', 80, 90, 90], ['典韦', 80, 90, 90]], columns = ['姓名','语文', '英语', '数学']) stu_score = stu_score.set_index('姓名') # 将某一列作为索引 stu_score = stu_score.fillna(axis=1, method='ffill') # 以左边来填充缺失值 stu_score['总分'] = stu_score.apply(sum , axis=1) stu_score
作者回复: 正确，使用的很熟练
2019-11-07
2
7
董大琳儿
都没听懂，感到淡淡的忧伤~~~
作者回复: 慢慢来董大琳
2019-06-20

6
Answer Liu
df6 = pd.DataFrame( {"语文": [66, 95, 95, 90, 80, 80], "数学": [65, 85, 92, 88, 90, 90], "英语": [np.nan, 98, 96, 77, 90, 90]}, index=['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'] ) # 去重 df7 = df6.drop_duplicates() # 替换NaN值 df8 = df7.fillna(df7['英语'].mean()) # 增加一行统计 df8['sum'] = [df8.loc[name].sum() for name in df8.index] # 按总分倒序排列 df9 = df8.sort_values(by="sum", ascending=False) print(df9)
作者回复: Good Job
2019-10-22

5
qinggeouye
import numpy as np import pandas as pd scores = pd.DataFrame( {'姓名': ['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'], '语文': [66, 95, 95, 90, 80, 80], '英语': [65, 85, 92, 88, 90, 90], '数学': [np.NaN, 98, 96, 77, 90, 90], }) print(scores) # 查找空值所在的列 isNaN = scores.isna().any() # isnull(), isnull().any() isNaN = isNaN[isNaN == True] print(scores[isNaN.index]) # 列的平均值填充空值 for col in isNaN.index: scores[col].fillna(scores[col].mean(), inplace=True) print(scores) # 去除不必要的行（空值） # scores = scores.drop(index=[0]) # scores = scores.dropna() # 去除重复行 scores = scores.drop_duplicates() print(scores) # 新增一列'总和' # scores['总和'] = scores['语文'] + scores['数学'] + scores['英语'] scores['总和'] = scores.sum(axis=1) print(scores)
作者回复: Good Job
2019-11-03
2
4
龟仙人
老师你好，你好像没有在哪里明确说明自己的环境是python2.7的，结果大家的使用环境大多数是3.0的，多多少少会引发一些问题。还有请问，微信群怎么加？
作者回复: 后面例子用的是Python3，微信群需要找运营同学，让运营同学拉你入群
2019-01-27
3
4
Grandia_Z
照着老师写 df2 = df2.drop(columns=['Chinese']) 这行代码后,返回结果是: TypeError Traceback (most recent call last) <ipython-input-25-8116650c61ac> in <module>() ----> 1 df2 = df2.drop(columns=['Chinese']) TypeError: drop() got an unexpected keyword argument 'columns' 这个什么意思
作者回复: 我运行没有问题，是正确的。我使用的是py2.7版本，另外你在开头引用了 DataFrame和pandas工具包了么你可以联系编辑，加微信群，我帮你看下
2018-12-24

3
窝窝头
import pandas as pd data = {'语文': [66, 95, 93, 90, 80, 80], '英语': [65, 85, 92, 88, 90, 90], '数学': [None, 98, 96, 77, 90, 90]} df = pd.DataFrame(data, index=['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'], columns=[u'英语', u'数学', u'语文']) df=df.dropna() df = df.drop_duplicates() df[u'总和'] = df[u'语文']+df[u'英语']+df[u'数学'] df.head()
作者回复: Good Job
2019-06-24

2
青石
from pandas import DataFrame def score(df): df['score'] = df[u'Chinese'] + df[u'English'] + df[u'Math'] return df data = {'Chinese': [66, 95, 95, 90, 80, 80], 'English': [65, 85, 92, 88, 90, 90], 'Math': [None, 98, 96, 77, 90, 90]} df = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei', 'DianWei'], columns=['Chinese', 'English', 'Math']) df = df.drop_duplicates().fillna(0) df = df.apply(score, axis=1) print(df)
作者回复: Good Job
2019-04-11

2

收起评论