Python编程系列7-Pandas库

pandas 是基于 Numpy 构建的,让以 Numpy 为中心的应用变得更加简单。

pandas主要包括三类数据结构,分别是:

Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近,其区别是:List中的元素可以是不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。

DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。

Panel :三维的数组,可以理解为DataFrame的容器。

Pandas官网,更多功能请参考http://pandas-docs.github.io/pandas-docs-travis/index.html

In [1]:
#首先导入库
import pandas as pd
import numpy as np

Series

由一组数据(各种Numpy数据类型),以及一组与之相关的标签数据(即索引)组成。仅由一组数据即可产生最简单的Series,可以通过传递一个list对象来创建一个Seriess

In [2]:
s = pd.Series([1,3,5,np.nan,6,8])
s
Out[2]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

获取 Series 的索引:

In [3]:
s.index
Out[3]:
RangeIndex(start=0, stop=6, step=1)

DataFrame

DataFrame是一个表格型的数据结构,它含有一组有序的列,每一列的数据结构都是相同的,而不同的列之间则可以是不同的数据结构(数值、字符、布尔值等)。或者以数据库进行类比,DataFrame中的每一行是一个记录,名称为Index的一个元素,而每一列则为一个字段,是这个记录的一个属性。DataFrame既有行索引也有列索引,可以被看做由Series组成的字典(共用同一个索引)。

创建一个DataFrame,包括一个numpy array, 时间索引和列名字:

In [4]:
dates = pd.date_range('20130101', periods=6)
dates
Out[4]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
In [5]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
Out[5]:
ABCD
2013-01-01-0.023872-0.6621490.035905-0.673801
2013-01-02-1.2942550.3381530.073737-0.158299
2013-01-030.2984981.0690451.2497690.734961
2013-01-04-0.883381-0.115720-0.8783700.265243
2013-01-05-0.090449-0.084664-1.489024-0.838460
2013-01-061.2557591.1215891.469664-1.234349

通过传递一个能够被转换成类似序列结构的字典对象来创建一个DataFrame:

In [6]:
df2 = pd.DataFrame({'A':1.,
                    'B':pd.Timestamp('20130102'),
                    'C':pd.Series(1, index=list(range(4)),dtype='float32'),
                    'D':np.array([3] * 4, dtype='int32'),
                    'E':pd.Categorical(['test','train', 'test','train']),
                    'F':'foo'
                   })
df2
Out[6]:
ABCDEF
01.02013-01-021.03testfoo
11.02013-01-021.03trainfoo
21.02013-01-021.03testfoo
31.02013-01-021.03trainfoo

查看不同列的数据类型:

In [7]:
df2.dtypes
Out[7]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

使用Tab自动补全功能会自动识别所有的属性以及自定义的列

查看数据

我们以平台获取的数据为例进行讲解:

In [55]:
# 获取平安银行近几个工作日的开盘价、最高价、最低价、收盘价。
df = get_price('000001.XSHE','2017-06-01','2017-06-14','1d',['open','high','low','close'])
df
Out[55]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
2017-06-069.019.068.999.04
2017-06-079.029.159.019.13
2017-06-089.119.159.089.13
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-139.119.149.059.12
2017-06-149.129.139.049.08

查看前几条数据:

In [10]:
df.head()
Out[10]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
2017-06-069.019.068.999.04
2017-06-079.029.159.019.13

查看后几条数据:

In [11]:
df.tail()
Out[11]:
openhighlowclose
2017-06-089.119.159.089.13
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-139.119.149.059.12
2017-06-149.129.139.049.08

显示索引、列和底层的numpy数据:

In [12]:
df.index
Out[12]:
DatetimeIndex(['2017-06-01', '2017-06-02', '2017-06-05', '2017-06-06',
               '2017-06-07', '2017-06-08', '2017-06-09', '2017-06-12',
               '2017-06-13', '2017-06-14'],
              dtype='datetime64[ns]', freq=None)

查看 DataFrame 的列名

In [13]:
df.columns
Out[13]:
Index(['open', 'high', 'low', 'close'], dtype='object')

查看 DataFrame 的值

In [15]:
df.values
Out[15]:
array([[ 9.2 ,  9.23,  9.12,  9.19],
       [ 9.18,  9.29,  9.14,  9.17],
       [ 9.13,  9.17,  8.99,  9.03],
       [ 9.01,  9.06,  8.99,  9.04],
       [ 9.02,  9.15,  9.01,  9.13],
       [ 9.11,  9.15,  9.08,  9.13],
       [ 9.15,  9.22,  9.12,  9.15],
       [ 9.15,  9.19,  9.1 ,  9.11],
       [ 9.11,  9.14,  9.05,  9.12],
       [ 9.12,  9.13,  9.04,  9.08]])

describe()函数对于数据的快速统计汇总:

In [16]:
df.describe()
Out[16]:
openhighlowclose
count10.00000010.0000010.00000010.000000
mean9.1180009.173009.0640009.115000
std0.0616080.063430.0560160.052122
min9.0100009.060008.9900009.030000
25%9.1100009.142509.0175009.087500
50%9.1250009.160009.0650009.125000
75%9.1500009.212509.1150009.145000
max9.2000009.290009.1400009.190000

对数据的转置(tranverse):

In [17]:
df.T
Out[17]:
2017-06-01 00:00:002017-06-02 00:00:002017-06-05 00:00:002017-06-06 00:00:002017-06-07 00:00:002017-06-08 00:00:002017-06-09 00:00:002017-06-12 00:00:002017-06-13 00:00:002017-06-14 00:00:00
open9.209.189.139.019.029.119.159.159.119.12
high9.239.299.179.069.159.159.229.199.149.13
low9.129.148.998.999.019.089.129.109.059.04
close9.199.179.039.049.139.139.159.119.129.08

按轴进行排序

In [18]:
df.sort_index(axis=1,ascending=False)
Out[18]:
openlowhighclose
2017-06-019.209.129.239.19
2017-06-029.189.149.299.17
2017-06-059.138.999.179.03
2017-06-069.018.999.069.04
2017-06-079.029.019.159.13
2017-06-089.119.089.159.13
2017-06-099.159.129.229.15
2017-06-129.159.109.199.11
2017-06-139.119.059.149.12
2017-06-149.129.049.139.08

6.按值进行排序

In [19]:
df.sort_values(by= 'open',ascending = False)
Out[19]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-059.139.178.999.03
2017-06-149.129.139.049.08
2017-06-089.119.159.089.13
2017-06-139.119.149.059.12
2017-06-079.029.159.019.13
2017-06-069.019.068.999.04

选择数据

通过下标选取数据:

df[‘open’],df.open
以上两个语句是等效的,都是返回 df 名称为 open 列的数据,返回的为一个 Series。
df[0:3], df[‘2017-06-01′:’2017-06-05’]
下标索引选取的是 DataFrame 的记录,与 List 相同 DataFrame 的下标也是从0开始,区间索引的话,为一个左闭右开的区间,即[0:3]选取的为0-2三条记录。
与此等价,还可以用起始的索引名称和结束索引名称选取数据,如:df[‘a’:’b’]。有一点需要注意的是使用起始索引名称和结束索引名称时,也会包含结束索引的数据。具体看下方示例:
以上两种方式返回的都是DataFrame。

选择一列数据:

In [20]:
df['open']
Out[20]:
2017-06-01    9.20
2017-06-02    9.18
2017-06-05    9.13
2017-06-06    9.01
2017-06-07    9.02
2017-06-08    9.11
2017-06-09    9.15
2017-06-12    9.15
2017-06-13    9.11
2017-06-14    9.12
Name: open, dtype: float64

选择多列数据:

In [21]:
df[['open','close']]
Out[21]:
openclose
2017-06-019.209.19
2017-06-029.189.17
2017-06-059.139.03
2017-06-069.019.04
2017-06-079.029.13
2017-06-089.119.13
2017-06-099.159.15
2017-06-129.159.11
2017-06-139.119.12
2017-06-149.129.08

选择多行:

In [22]:
df[0:3]
Out[22]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
In [23]:
df['2017-06-01':'2017-06-05']
Out[23]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03

使用标签选取数据:

df.loc[行标签,列标签]

df.loc[‘a’:’b’] #选取 ab 两行数据

df.loc[:,’open’] #选取 open 列的数据

df.loc 的第一个参数是行标签,第二个参数为列标签(可选参数,默认为所有列标签),两个参数既可以是列表也可以是单个字符,如果两个参数都为列表则返回的是 DataFrame,否则,则为 Series。

PS:loc为location的缩写。

In [27]:
df.loc['2017-06-01','open']
Out[27]:
9.1999999999999993
In [30]:
df.loc['2017-06-01':'2017-06-06']
Out[30]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
2017-06-069.019.068.999.04
In [31]:
df.loc[:, 'open']
Out[31]:
2017-06-01    9.20
2017-06-02    9.18
2017-06-05    9.13
2017-06-06    9.01
2017-06-07    9.02
2017-06-08    9.11
2017-06-09    9.15
2017-06-12    9.15
2017-06-13    9.11
2017-06-14    9.12
Name: open, dtype: float64
In [32]:
df.loc['2017-06-01':'2017-06-06','open']
Out[32]:
2017-06-01    9.20
2017-06-02    9.18
2017-06-05    9.13
2017-06-06    9.01
Name: open, dtype: float64

使用位置选取数据:

df.iloc[行位置,列位置]

df.iloc[1,1] #选取第二行,第二列的值,返回的为单个值

df.iloc[[0,2],:] #选取第一行及第三行的数据

df.iloc[0:2,:] #选取第一行到第三行(不包含)的数据

df.iloc[:,1] #选取所有记录的第二列的值,返回的为一个Series

df.iloc[1,:] #选取第一行数据,返回的为一个Series

PS:iloc 则为 integer & location 的缩写

In [33]:
df.iloc[1,1] # 选取第二行,第二列的值,返回的为单个值
Out[33]:
9.2899999999999991
In [56]:
df.iloc[[0,2],:] # 选取第一行及第三行的数据
Out[56]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-059.139.178.999.03
In [57]:
df.iloc[0:2,:] # 选取第一行到第三行(不包含)的数据
Out[57]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
In [58]:
df.iloc[:,1] # 选取所有记录的第一列的值,返回的为一个Series
Out[58]:
2017-06-01    9.23
2017-06-02    9.29
2017-06-05    9.17
2017-06-06    9.06
2017-06-07    9.15
2017-06-08    9.15
2017-06-09    9.22
2017-06-12    9.19
2017-06-13    9.14
2017-06-14    9.13
Name: high, dtype: float64
In [60]:
df.iloc[1,:] # 选取第一行数据,返回的为一个Series
Out[60]:
open     9.18
high     9.29
low      9.14
close    9.17
Name: 2017-06-02 00:00:00, dtype: float64

更广义的切片方式是使用.ix,它自动根据给到的索引类型判断是使用位置还是标签进行切片

In [62]:
df.ix[1,1]
Out[62]:
9.2899999999999991
In [63]:
df.ix['2017-06-01':'2017-06-05']
Out[63]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
In [65]:
df.ix['2017-06-05','open']
Out[65]:
9.1300000000000008
In [66]:
df.ix[1,'open']
Out[66]:
9.1799999999999997
In [68]:
df.ix['2017-06-01',0]
Out[68]:
9.1999999999999993

通过逻辑指针进行数据切片:

df[逻辑条件]

df[df.one >= 2] #单个逻辑条件

df[(df.one >=1 ) & (df.one < 3) ] #多个逻辑条件组合

In [69]:
# 筛选出 open 大于 8.8 的数据
df[df.open > 9.12]
Out[69]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
In [71]:
# 筛选出 open 大于 9.12 的数据,并且 close 小于 9.17 的数据
df[(df.open >9.12) & (df.close < 9.17)]
Out[71]:
openhighlowclose
2017-06-059.139.178.999.03
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11

使用 条件过来更改数据。

In [72]:
df[df>10]
Out[72]:
openhighlowclose
2017-06-01NaNNaNNaNNaN
2017-06-02NaNNaNNaNNaN
2017-06-05NaNNaNNaNNaN
2017-06-06NaNNaNNaNNaN
2017-06-07NaNNaNNaNNaN
2017-06-08NaNNaNNaNNaN
2017-06-09NaNNaNNaNNaN
2017-06-12NaNNaNNaNNaN
2017-06-13NaNNaNNaNNaN
2017-06-14NaNNaNNaNNaN

观察可以发现,df 中小于 9 的数都变为 NaN。

下面我们就把大于 9 的数赋值为0.

In [73]:
df[df > 9.12] = 0
In [74]:
df
Out[74]:
openhighlowclose
2017-06-010.000.009.120.00
2017-06-020.000.000.000.00
2017-06-050.000.008.999.03
2017-06-069.019.068.999.04
2017-06-079.020.009.010.00
2017-06-089.110.009.080.00
2017-06-090.000.009.120.00
2017-06-120.000.009.109.11
2017-06-139.110.009.059.12
2017-06-149.120.009.049.08

发现大于 9.12 的数都被替换为了0.

使用isin()方法来过滤在指定列中的数据:

In [75]:
# 选取 high 列中数为 0 和 9 的数。
df[df['high'].isin([0.00,9.00])]
Out[75]:
openhighlowclose
2017-06-010.000.09.120.00
2017-06-020.000.00.000.00
2017-06-050.000.08.999.03
2017-06-079.020.09.010.00
2017-06-089.110.09.080.00
2017-06-090.000.09.120.00
2017-06-120.000.09.109.11
2017-06-139.110.09.059.12
2017-06-149.120.09.049.08

Panel

平台get_price,如果是多支股票, 则返回pandas.Panel对象。

可通过 panel[列标,行标,股票代码] 获取数据.

In [77]:
panel = get_price(['000001.XSHE','000002.XSHE'],'2017-06-01','2017-06-14', '1d', fields=['open','high','low','close'])
panel
Out[77]:
<class 'rqcommons.pandas_patch.HybridDataPanel'>
Dimensions: 4 (items) x 10 (major_axis) x 2 (minor_axis)
Items axis: open to close
Major_axis axis: 2017-06-01 00:00:00 to 2017-06-14 00:00:00
Minor_axis axis: 000001.XSHE to 000002.XSHE

由打印的结果可以看出:

列标(Items axis: close to open)

行标(Major_axis axis: 2017-06-01 00:00:00 to 2017-06-14 00:00:00)

股票代码(Minor_axis axis: 000001.XSHE to 000002.XSHE)

In [78]:
# 取出 'open'数据
panel['open',:,:]
Out[78]:
000001.XSHE000002.XSHE
2017-06-019.2021.10
2017-06-029.1821.08
2017-06-059.1320.75
2017-06-069.0121.05
2017-06-079.0220.87
2017-06-089.1120.87
2017-06-099.1520.87
2017-06-129.1521.48
2017-06-139.1121.91
2017-06-149.1221.90
In [79]:
# 取出 '2016-07-15'数据
panel[:,'2017-06-01',:]
Out[79]:
openhighlowclose
000001.XSHE9.29.239.129.19
000002.XSHE21.121.3720.9321.09
In [80]:
# 取出 000001 的 DataFrame 数据
panel[:,:,'000001.XSHE']
Out[80]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
2017-06-069.019.068.999.04
2017-06-079.029.159.019.13
2017-06-089.119.159.089.13
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-139.119.149.059.12
2017-06-149.129.139.049.08

Panel 操作与 DataFrame 基本相同,下节会为大家讲解 DataFrame 的数据处理与规整。

缺失数据处理

去掉包含缺失值的行:

In [85]:
# 获取平安银行近几个工作日的开盘价、最高价、最低价、收盘价。
df = get_price('000001.XSHE','2017-06-01','2017-06-14', '1d', fields=['open','high','low','close'])
df = df[df > 9.0] 
df
Out[85]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.17NaN9.03
2017-06-069.019.06NaN9.04
2017-06-079.029.159.019.13
2017-06-089.119.159.089.13
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-139.119.149.059.12
2017-06-149.129.139.049.08
In [86]:
df.dropna()
Out[86]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-079.029.159.019.13
2017-06-089.119.159.089.13
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-139.119.149.059.12
2017-06-149.129.139.049.08

对缺失值进行填充:

In [87]:
df.fillna(value=0)
Out[87]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.170.009.03
2017-06-069.019.060.009.04
2017-06-079.029.159.019.13
2017-06-089.119.159.089.13
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-139.119.149.059.12
2017-06-149.129.139.049.08

判断数据是否为nan,并进行布尔填充:

In [88]:
pd.isnull(df)
Out[88]:
openhighlowclose
2017-06-01FalseFalseFalseFalse
2017-06-02FalseFalseFalseFalse
2017-06-05FalseFalseTrueFalse
2017-06-06FalseFalseTrueFalse
2017-06-07FalseFalseFalseFalse
2017-06-08FalseFalseFalseFalse
2017-06-09FalseFalseFalseFalse
2017-06-12FalseFalseFalseFalse
2017-06-13FalseFalseFalseFalse
2017-06-14FalseFalseFalseFalse

函数的应用和映射

In [89]:
df.mean()#列计算平均值
Out[89]:
open     9.1180
high     9.1730
low      9.0825
close    9.1150
dtype: float64
In [90]:
df.mean(1)#行计算平均值
Out[90]:
2017-06-01    9.185000
2017-06-02    9.195000
2017-06-05    9.110000
2017-06-06    9.036667
2017-06-07    9.077500
2017-06-08    9.117500
2017-06-09    9.160000
2017-06-12    9.137500
2017-06-13    9.105000
2017-06-14    9.092500
dtype: float64
In [91]:
df.mean(axis = 1,skipna = False) # skipna参数默认是 True 表示排除缺失值
Out[91]:
2017-06-01    9.1850
2017-06-02    9.1950
2017-06-05       NaN
2017-06-06       NaN
2017-06-07    9.0775
2017-06-08    9.1175
2017-06-09    9.1600
2017-06-12    9.1375
2017-06-13    9.1050
2017-06-14    9.0925
dtype: float64

常用的方法如上所介绍们,还要其他许多,可自行学习,下面罗列了一些,可供参考:

  • count 非na值的数量
  • describe 针对Series或个DataFrame列计算汇总统计
  • min、max 计算最小值和最大值
  • argmin、argmax 计算能够获取到最大值和最小值得索引位置(整数)
  • idxmin、idxmax 计算能够获取到最大值和最小值得索引值
  • quantile 计算样本的分位数(0到1)
  • sum 值的总和
  • mean 值得平均数
  • median 值得算术中位数(50%分位数)
  • mad 根据平均值计算平均绝对离差
  • var 样本值的方差
  • std 样本值的标准差
  • skew 样本值得偏度(三阶矩)
  • kurt 样本值得峰度(四阶矩)
  • cumsum 样本值得累计和
  • cummin,cummax 样本值得累计最大值和累计最小值
  • cumprod 样本值得累计积
  • diff 计算一阶差分(对时间序列很有用)
  • pct_change 计算百分数变化

数据规整

Pandas提供了大量的方法能够轻松的对Series,DataFrame和Panel对象进行各种符合各种逻辑关系的合并操作

  • concat 可以沿一条轴将多个对象堆叠到一起。
  • append 将一行连接到一个DataFrame上
  • duplicated 移除重复数据

concat

In [92]:
df1 = get_price('000001.XSHE','2017-06-01', '2017-06-14','1d',['open','high','low','close'])
df1
Out[92]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
2017-06-069.019.068.999.04
2017-06-079.029.159.019.13
2017-06-089.119.159.089.13
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-139.119.149.059.12
2017-06-149.129.139.049.08
In [94]:
df2 = get_price('000001.XSHE','2017-05-15','2017-05-31','1d',['open','high','low','close'])
df2
Out[94]:
openhighlowclose
2017-05-158.898.958.818.86
2017-05-168.848.858.738.84
2017-05-178.818.818.758.77
2017-05-188.728.778.718.73
2017-05-198.748.768.688.69
2017-05-228.688.738.618.68
2017-05-238.678.848.648.79
2017-05-248.788.838.688.81
2017-05-258.799.148.789.10
2017-05-269.089.139.049.10
2017-05-319.109.239.069.20
In [95]:
pd.concat([df1,df2],axis=0)
Out[95]:
            open  high   low  close
2017-06-01  9.20  9.23  9.12   9.19
2017-06-02  9.18  9.29  9.14   9.17
2017-06-05  9.13  9.17  8.99   9.03
2017-06-06  9.01  9.06  8.99   9.04
2017-06-07  9.02  9.15  9.01   9.13
2017-06-08  9.11  9.15  9.08   9.13
2017-06-09  9.15  9.22  9.12   9.15
2017-06-12  9.15  9.19  9.10   9.11
2017-06-13  9.11  9.14  9.05   9.12
2017-06-14  9.12  9.13  9.04   9.08
2017-05-15  8.89  8.95  8.81   8.86
2017-05-16  8.84  8.85  8.73   8.84
2017-05-17  8.81  8.81  8.75   8.77
2017-05-18  8.72  8.77  8.71   8.73
2017-05-19  8.74  8.76  8.68   8.69
2017-05-22  8.68  8.73  8.61   8.68
2017-05-23  8.67  8.84  8.64   8.79
2017-05-24  8.78  8.83  8.68   8.81
2017-05-25  8.79  9.14  8.78   9.10
2017-05-26  9.08  9.13  9.04   9.10
2017-05-31  9.10  9.23  9.06   9.20

横向拼接,index对不上的会用 NaN 填充:

In [97]:
pd.concat([df1,df2],axis=1)
Out[97]:
            open  high   low  close  open  high   low  close
2017-05-15   NaN   NaN   NaN    NaN  8.89  8.95  8.81   8.86
2017-05-16   NaN   NaN   NaN    NaN  8.84  8.85  8.73   8.84
2017-05-17   NaN   NaN   NaN    NaN  8.81  8.81  8.75   8.77
2017-05-18   NaN   NaN   NaN    NaN  8.72  8.77  8.71   8.73
2017-05-19   NaN   NaN   NaN    NaN  8.74  8.76  8.68   8.69
2017-05-22   NaN   NaN   NaN    NaN  8.68  8.73  8.61   8.68
2017-05-23   NaN   NaN   NaN    NaN  8.67  8.84  8.64   8.79
2017-05-24   NaN   NaN   NaN    NaN  8.78  8.83  8.68   8.81
2017-05-25   NaN   NaN   NaN    NaN  8.79  9.14  8.78   9.10
2017-05-26   NaN   NaN   NaN    NaN  9.08  9.13  9.04   9.10
2017-05-31   NaN   NaN   NaN    NaN  9.10  9.23  9.06   9.20
2017-06-01  9.20  9.23  9.12   9.19   NaN   NaN   NaN    NaN
2017-06-02  9.18  9.29  9.14   9.17   NaN   NaN   NaN    NaN
2017-06-05  9.13  9.17  8.99   9.03   NaN   NaN   NaN    NaN
2017-06-06  9.01  9.06  8.99   9.04   NaN   NaN   NaN    NaN
2017-06-07  9.02  9.15  9.01   9.13   NaN   NaN   NaN    NaN
2017-06-08  9.11  9.15  9.08   9.13   NaN   NaN   NaN    NaN
2017-06-09  9.15  9.22  9.12   9.15   NaN   NaN   NaN    NaN
2017-06-12  9.15  9.19  9.10   9.11   NaN   NaN   NaN    NaN
2017-06-13  9.11  9.14  9.05   9.12   NaN   NaN   NaN    NaN
2017-06-14  9.12  9.13  9.04   9.08   NaN   NaN   NaN    NaN

append

In [98]:
df1
Out[98]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
2017-06-069.019.068.999.04
2017-06-079.029.159.019.13
2017-06-089.119.159.089.13
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-139.119.149.059.12
2017-06-149.129.139.049.08
In [99]:
s = df1.iloc[0]
s
Out[99]:
open     9.20
high     9.23
low      9.12
close    9.19
Name: 2017-06-01 00:00:00, dtype: float64
In [100]:
df1.append(s, ignore_index=False) # ignore_index=False 表示索引不变
Out[100]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
2017-06-069.019.068.999.04
2017-06-079.029.159.019.13
2017-06-089.119.159.089.13
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-139.119.149.059.12
2017-06-149.129.139.049.08
2017-06-019.209.239.129.19
In [101]:
df1.append(s, ignore_index=True) # ignore_index=True 表示索引重置
Out[101]:
openhighlowclose
09.209.239.129.19
19.189.299.149.17
29.139.178.999.03
39.019.068.999.04
49.029.159.019.13
59.119.159.089.13
69.159.229.129.15
79.159.199.109.11
89.119.149.059.12
99.129.139.049.08
109.209.239.129.19

移除重复数据duplicated

In [102]:
z = df1.append(s, ignore_index=False)
z
Out[102]:
openhighlowclose
2017-06-019.209.239.129.19
2017-06-029.189.299.149.17
2017-06-059.139.178.999.03
2017-06-069.019.068.999.04
2017-06-079.029.159.019.13
2017-06-089.119.159.089.13
2017-06-099.159.229.129.15
2017-06-129.159.199.109.11
2017-06-139.119.149.059.12
2017-06-149.129.139.049.08
2017-06-019.209.239.129.19

查看重复数据:

In [104]:
z.duplicated()
Out[104]:
2017-06-01    False
2017-06-02    False
2017-06-05    False
2017-06-06    False
2017-06-07    False
2017-06-08    False
2017-06-09    False
2017-06-12    False
2017-06-13    False
2017-06-14    False
2017-06-01     True
dtype: bool

 

Python编程系列目录:
1. Python编程系列1-变量、字符串与列表
2. Python编程系列2-字典、元组与集合
3. Python编程系列3-条件与循环
4. Python编程系列4-函数
5. Python编程系列5-Numpy库
6. Python编程系列6-Scipy库
当前阅读> 7. Python编程系列7-Pandas库
8. Python编程系列8-Matplotlib库