【Let’s Rock】Kalman Filter Applied to Pair Trading

配对交易社区之前有发过相关帖子,所以对其原理之类的就不再赘述
第一篇链接如下:
配对交易(Paper Version)

以及之前自己写过的一个漏洞比较多的简单的配对交易贴,这个帖子逻辑有点问题:
配对交易

但是前两篇帖子都是以一个恒定的对冲比率来进行策略设计,我在这里主要是要使静态的对冲比率转化为动态的,避免存在样本内过拟合的情形。
现在国外的比较流行的方法是借用卡尔曼滤波的思想来进行动态对冲

以下是维基百科对kalman filter的详细解释 :
https://en.wikipedia.org/wiki/Kalman_filter

中文简单的解释一下:
卡尔曼滤波(Kalman filtering):一种算法,利用线性系统状态方程,通过系统输入输出观测数据,对系统状态进行最优估计。由于观测数据中包括系统中的噪声和干扰的影响,所以最优估计也可看作是滤波过程。
数据滤波是去除噪声还原真实数据的一种数据处理技术, Kalman滤波在测量方差已知的情况下能够从一系列存在测量噪声的数据中,估计动态系统的状态.

配对交易策略的简述:

配对交易其基本思想在社区里有前辈已经说得很清楚了,但是其局限性在于配对的hedge ratio目前还是一个固定值,并且存在样本内表现很好,但样本外表现不尽如人意的现象。我们知道,随着时间的推移,最优的对冲比率应该是不断变化的,所以如何使hedge ratio动起来,这是我们这次研究的主要目的。
目前参照国外相关网站的资料知道,用Kalman Filter来动态调整hedge ratio是当下一个比较流行的做法,这次思路也是跟随于此

在构建配对交易策略之前,我们首先要做的就是找出具有协整关系的股票对,但这里有点需要强调的是相关系数高的与协整关系的强弱没有必然的联系,这是一个大家容易进的误区,底下会给出具体的事例加以说明

在找出协整关系股票标的基础上,接下来就是使用Kalman Filter方法来估计出state_means,此处其是一个二维数组,前者代表斜率后者代表截距项。关于Kalman Filter的具体原理,有兴趣的读者可以继续深挖,此处就不做详细介绍,因为此处主要是对方法的应用,重点在配对交易上
为了更好的说明问题,我在这里同时应用了静态对冲和动态对冲两种方法,以比较二者的差异,具体的实战表现如何则需要通过策略板块来进行回测模拟,这个后面会补充进来

下面进入Demo

In [1]:
import numpy as np
import pandas as pd
import statsmodels.tsa.stattools as sts
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import coint
from pykalman import KalmanFilter

配对交易之前,我们首先是要找到具有协整关系的股票对,然后在此基础上构建策略
但我们大部分读者以前可能误以为相关性很高的股票对就是协整性很强的配对交易标的,这是不对的,以下以两个实例来解释说明一下

Instance 1 : Correlation without Cointegration

In [2]:
np.random.seed(100)
In [3]:
X = np.random.normal(1,1,100)
Y = np.random.normal(2,1,100)

X_cum = pd.Series(np.cumsum(X), name='X')
Y_cum = pd.Series(np.cumsum(Y), name='Y')

pd.concat([X_cum, Y_cum], axis=1).plot();
In [4]:
_, pvalue, _ = coint(X_cum, Y_cum)
corr = np.corrcoef(X_cum, Y_cum)[0,1]
print ('cointegration pvalue = ',pvalue)
print ('correlation coefficient = ',corr)
cointegration pvalue =  0.16104204814
correlation coefficient =  0.990499430003

从以上结果可知,协整的P值显著不为0,说明不能拒绝原假设(null hypothesis)
The Null hypothesis is that there is no cointegration, the alternative hypothesis is that there is cointegrating relationship. If the pvalue is small, below a critical size, then we can reject the hypothesis that there is no cointegrating relationship.

相关系数接近100%,为0.99,说明二者的相关性很高

Instance 2 :Cointegration without Correlation

In [5]:
X = pd.Series(2*np.random.normal(0,1,1000)) + 50
X_wave = X.copy()
In [6]:
X_wave[:100] = 60
X_wave[100:200] = 40
X_wave[200:300] = 60
X_wave[300:400] = 40
X_wave[400:500] = 60
X_wave[500:600] = 40
X_wave[600:700] = 60
X_wave[700:800] = 40
X_wave[800:900] = 60
X_wave[900:1000] = 40

X.plot()
X_wave.plot()
plt.ylim([20,80])
Out[6]:
(20, 80)
In [7]:
_, pvalue, _ = coint(X, X_wave)
corr = np.corrcoef(X, X_wave)[0,1]
print ('cointegration pvalue = ',pvalue)
print ('correlation coefficient = ',corr)
cointegration pvalue =  0.0
correlation coefficient =  0.00875576208136

协整的P值为零,故拒绝没有协整关系的原假设,但是二者的相关系数接近于零,说明几乎无相关性

以上两个虚构的例子只是为了说明协整和相关性并不是一回事,大家要走出这个误区,不能简单的以相关性的高低来判断协整关系的强弱

下面我们以真实市场的股票价格来进行配对交易的策略设计

由于股灾期间,发生停牌股的股票标的特别多,所以此处选取的股票基数应比较大,此处选取50

In [8]:
# 从同一板块选取股票这里我们选择金融板块,主要是金融股大部分是蓝筹股,流动性高,适合作为配对交易的标的
selected_stocks = sector('Financials')
In [9]:
# 提取股票价格
start = '2014-01-01'
end = '2015-12-31'
stock_df = get_price(selected_stocks, start_date=start, end_date=end, fields='ClosingPx').dropna(axis=1)
stock_df.tail()
Out[9]:
000001.XSHE000002.XSHE000006.XSHE000011.XSHE000014.XSHE000029.XSHE000031.XSHE000036.XSHE000038.XSHE000040.XSHE601588.XSHG601601.XSHG601628.XSHG601688.XSHG601788.XSHG601818.XSHG601901.XSHG601939.XSHG601988.XSHG601998.XSHG
MDEntryDate
2015-12-2512.4124.4312.0615.1623.6113.4614.589.1156.915.075.5730.4030.0521.4124.014.349.895.904.067.64
2015-12-2811.9824.4311.5114.3222.2512.9214.158.6956.914.505.3529.2328.9420.1722.884.229.595.784.027.26
2015-12-2912.0924.4311.7414.4423.0812.9214.368.9256.914.155.4329.1628.9920.2423.314.269.755.804.047.31
2015-12-3012.1024.4311.7614.4923.2512.8314.338.8556.914.565.4228.8228.5420.0723.144.269.725.794.037.32
2015-12-3111.9924.4311.5114.5222.8112.9214.058.9756.914.935.3628.8628.3119.7222.944.249.605.784.017.22

5 rows × 185 columns

因为股票作为配对交易标的时,主要是考察它们的收益率之间是否具有协整关系,所以我们这里也转为考察收益率

In [10]:
stock_ret_df = np.log(stock_df).diff().dropna()
stock_ret_df.head()
Out[10]:
000001.XSHE000002.XSHE000006.XSHE000011.XSHE000014.XSHE000029.XSHE000031.XSHE000036.XSHE000038.XSHE000040.XSHE601588.XSHG601601.XSHG601628.XSHG601688.XSHG601788.XSHG601818.XSHG601901.XSHG601939.XSHG601988.XSHG601998.XSHG
MDEntryDate
2014-01-03-0.025046-0.018952-0.029169-0.023287-0.016643-0.021334-0.016439-0.013986-0.006380-0.016929-0.011132-0.022310-0.026415-0.028655-0.015143-0.0114290-0.009780-0.019343-0.015666
2014-01-06-0.021979-0.047006-0.058776-0.037338-0.045439-0.049734-0.025176-0.021353-0.025933-0.012270-0.018833-0.0280260.0143100.017291-0.016568-0.01544400.002454-0.011788-0.015915
2014-01-07-0.002472-0.006707-0.0022450.0013580.005152-0.0085350.0028290.003591-0.0207390.0000000.000000-0.005817-0.002710-0.0126510.0023840.0000000-0.017306-0.003960-0.002677
2014-01-080.011077-0.001347-0.013575-0.021949-0.0072200.000000-0.005666-0.0071940.066475-0.004951-0.0191940.004076-0.0033980.003466-0.0071680.0000000-0.007509-0.0039760.002677
2014-01-090.0048840.005376-0.039494-0.012561-0.027284-0.034887-0.0143060.010772-0.035932-0.004975-0.007782-0.009927-0.026208-0.015108-0.027965-0.0078130-0.007566-0.003992-0.005362

5 rows × 185 columns

累计5天价格没有发生变化,等价于收益率为0,则认为所选时间段期间股票发生了停牌,由于停牌股在复牌后表现异常,故排除掉此类股票

In [11]:
# import copy

selected_stocks = list(stock_ret_df.columns.values)
#print(len(selected_stocks))
for stock in selected_stocks[:]:           # 注意一定要用selected_stocks[:] 
    #print(stock)
    s=stock_ret_df[stock_ret_df[stock]!=0]    
    if len(s)<len(stock_ret_df) - 5:
        selected_stocks.remove(stock)
print ('selected stocks :',selected_stocks)
selected stocks : ['000728.XSHE', '600773.XSHG', '601601.XSHG', '601628.XSHG']
In [12]:
# 找出以上具有协整关系的股票对的函数
def  find_cointegrated_pairs(stock_list, stock_ret_df):    # 两个参数分别代表股票池和收益的DataFrame
    n = len(stock_list)
    score_matrix = np.zeros((n,n))   
    pvalue_matrix = np.ones((n,n))    # 注意初始矩阵构建的设定值
    pairs = []
    
    for i in range(n):
        for j in range(i+1, n):
            S1 = stock_ret_df[stock_list[i]]
            S2 = stock_ret_df[stock_list[j]]
            result = coint(S1,S2)
            score = result[0]         # 协整检验的t值(可以选择不予考虑)
            pvalue = result[1]        # 协整检验的P值
            score_matrix[i,j] = score
            pvalue_matrix[i,j] = pvalue
            if pvalue < 0.05:
                pairs.append((stock_list[i],stock_list[j]))
                
    return score_matrix, pvalue_matrix, pairs
In [13]:
# 找出股票收益之间符合配对条件的股票对
# 调用函数
scores, pvalues, pairs = find_cointegrated_pairs(selected_stocks, stock_ret_df)
print ('cointegrated pairs : ',pairs)
cointegrated pairs :  [('000728.XSHE', '600773.XSHG'), ('000728.XSHE', '601601.XSHG'), ('000728.XSHE', '601628.XSHG'), ('600773.XSHG', '601601.XSHG'), ('600773.XSHG', '601628.XSHG'), ('601601.XSHG', '601628.XSHG')]
不难发现同板块之间的股票具有协整关系的太多,我们画图更直观的看一下

我们随机选取其中一对股票对进行验证,观察一下他们的收益走势图

In [14]:
# 选取第一组 (五只股票之间,两两都具有协整关系,此处任意选择)
stock_1 = pairs[0][0]  # 第一组第一支股票
stock_2 = pairs[0][1]    # 第一组第二支股票
_, pvalue, _ = coint(stock_ret_df[stock_1], stock_ret_df[stock_2])
print (pvalue)
0.0

P值为零,说明具有强协整关系

In [15]:
# 画出股价走势图
plt.plot(stock_df.index, stock_df[stock_1], label=pairs[0][0])
plt.plot(stock_df.index, stock_df[stock_2], label=pairs[0][1])
plt.legend(loc='best')
plt.xlabel('Date')
plt.ylabel('Prices');
In [16]:
# 画出二者的收益走势图
plt.plot(stock_ret_df.index, stock_ret_df[pairs[0][0]],label=pairs[0][0])
plt.plot(stock_ret_df.index, stock_ret_df[pairs[0][1]],label=pairs[0][1])
plt.legend(loc='best')
plt.xlabel('Date')
plt.ylabel('Log Return Rate');

由于收益变动图并不是很容易发现二者协整的关系,我们补充一个价格走势或者累计收益图

In [17]:
stock_ret_df.ix[0,:]=1
stock_ret_df.head()
Out[17]:
000001.XSHE000002.XSHE000006.XSHE000011.XSHE000014.XSHE000029.XSHE000031.XSHE000036.XSHE000038.XSHE000040.XSHE601588.XSHG601601.XSHG601628.XSHG601688.XSHG601788.XSHG601818.XSHG601901.XSHG601939.XSHG601988.XSHG601998.XSHG
MDEntryDate
2014-01-031.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.00000011.0000001.0000001.000000
2014-01-06-0.021979-0.047006-0.058776-0.037338-0.045439-0.049734-0.025176-0.021353-0.025933-0.012270-0.018833-0.0280260.0143100.017291-0.016568-0.01544400.002454-0.011788-0.015915
2014-01-07-0.002472-0.006707-0.0022450.0013580.005152-0.0085350.0028290.003591-0.0207390.0000000.000000-0.005817-0.002710-0.0126510.0023840.0000000-0.017306-0.003960-0.002677
2014-01-080.011077-0.001347-0.013575-0.021949-0.0072200.000000-0.005666-0.0071940.066475-0.004951-0.0191940.004076-0.0033980.003466-0.0071680.0000000-0.007509-0.0039760.002677
2014-01-090.0048840.005376-0.039494-0.012561-0.027284-0.034887-0.0143060.010772-0.035932-0.004975-0.007782-0.009927-0.026208-0.015108-0.027965-0.0078130-0.007566-0.003992-0.005362

5 rows × 185 columns

In [18]:
#画出二者的累计收益率走势图

cum_ret=stock_ret_df[list(pairs[0])].cumsum()
plt.plot(stock_ret_df.index,cum_ret.ix[:,0],'k-',label=pairs[0][0])
plt.plot(stock_ret_df.index,cum_ret.ix[:,1],'r-',label=pairs[0][1])
plt.legend(loc='best')
plt.xlabel('Date')
plt.ylabel('Cumulative Return Rate');
In [19]:
# 求出两支股票累计收益率的价差
abs_spread = cum_ret.diff(axis=1)
del abs_spread[stock_1] 
abs_spread.rename(columns={str(stock_2):'spread'},inplace=True)
print (abs_spread.head())
plt.plot(abs_spread.index, abs_spread, label='spread')
plt.xlabel('Date')
plt.ylabel('Absolute Spread')
plt.legend(loc='best');
               spread
MDEntryDate          
2014-01-03   0.000000
2014-01-06  -0.031024
2014-01-07  -0.032629
2014-01-08  -0.071846
2014-01-09  -0.073031
In [20]:
# 收益率之差的绝对值在统计学上的含义不明显,在这里转化为标准正太分布,以更直观的理解
def zscores(series):
    return (series - series.mean())/np.std(series)
In [21]:
print (zscores(abs_spread).head())
spread_series = zscores(abs_spread)

spread_series.plot(lw=2,color='k')
plt.axhline(np.mean(spread_series.values), color='k')
plt.axhline(1, color='r', linestyle='--')          # 此处选取的是一个标准差
plt.axhline(-1, color='r', linestyle='--');
               spread
MDEntryDate          
2014-01-03   0.791311
2014-01-06   0.701482
2014-01-07   0.696833
2014-01-08   0.583283
2014-01-09   0.579851

Static Hedging

现在我们先来看一下传统静态对冲的配对交易策略的效果,即取一个恒定的Beta值
进行静态对冲前,我们先来找出两个股票对之间的回归系数(alpha和beta值)
因为主要考察收益率之间的关系,所以我们依然继续回归到收益率层面

In [22]:
# 定义一个线性回归函数
from statsmodels import regression
import statsmodels.api as sm

stock_ret_df = stock_ret_df.ix[1:,:]

ret_1 = stock_ret_df[stock_1].values
ret_2 = stock_ret_df[stock_2].values

def linreg(x, y):
    x = sm.add_constant(x)
    model = regression.linear_model.OLS(y,x).fit()
    x = x[:,1]
    return model.params[0], model.params[1]

alpha, beta = linreg(ret_1, ret_2)
print ('alpha: ',alpha)
print ('beta: ',beta)
alpha:  0.000592403871854
beta:  0.425879354528
In [23]:
stock_ret_df.head()
Out[23]:
000001.XSHE000002.XSHE000006.XSHE000011.XSHE000014.XSHE000029.XSHE000031.XSHE000036.XSHE000038.XSHE000040.XSHE601588.XSHG601601.XSHG601628.XSHG601688.XSHG601788.XSHG601818.XSHG601901.XSHG601939.XSHG601988.XSHG601998.XSHG
MDEntryDate
2014-01-06-0.021979-0.047006-0.058776-0.037338-0.045439-0.049734-0.025176-0.021353-0.025933-0.012270-0.018833-0.0280260.0143100.017291-0.016568-0.01544400.002454-0.011788-0.015915
2014-01-07-0.002472-0.006707-0.0022450.0013580.005152-0.0085350.0028290.003591-0.0207390.0000000.000000-0.005817-0.002710-0.0126510.0023840.0000000-0.017306-0.003960-0.002677
2014-01-080.011077-0.001347-0.013575-0.021949-0.0072200.000000-0.005666-0.0071940.066475-0.004951-0.0191940.004076-0.0033980.003466-0.0071680.0000000-0.007509-0.0039760.002677
2014-01-090.0048840.005376-0.039494-0.012561-0.027284-0.034887-0.0143060.010772-0.035932-0.004975-0.007782-0.009927-0.026208-0.015108-0.027965-0.0078130-0.007566-0.003992-0.005362
2014-01-100.000000-0.010782-0.021558-0.035744-0.0412650.005900-0.035194-0.028988-0.024693-0.007509-0.0237170.000587-0.000699-0.005872-0.027502-0.00787400.002528-0.004008-0.002692

5 rows × 185 columns

In [24]:
# 画出用估计出的alpha值和beta值拟合的图形
new_ret_1 = np.linspace(ret_1.min(), ret_1.max(), 100)
ret_2_hat = new_ret_1 * beta + alpha

plt.scatter(ret_1, ret_2, alpha=0.3)
plt.plot(new_ret_1, ret_2_hat, alpha=0.9,color='r')
plt.xlabel(pairs[0][0])
plt.ylabel(pairs[0][1]);

Dynamic Hedging

接下来我们看看用KalmanFilter动态调整beta值会出现什么情况

In [25]:
from pykalman import KalmanFilter

# 我们先来看看两种资产动态的价格变化散点图
cm = plt.cm.get_cmap('jet')
dates = [str(p.date()) for p in stock_df[::len(stock_df)//10].index]
colors = np.linspace(0.1, 1, len(stock_df.index))
sc = plt.scatter(stock_df[stock_1],stock_df[stock_2], s=30, c=colors, cmap=cm, edgecolor='k', alpha=0.7)
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in stock_df[::len(stock_df)//9].index])
plt.xlabel(stock_1)
plt.ylabel(stock_2);
In [26]:
# 我们先来看看两种资产动态的收益率变化散点图
cm = plt.cm.get_cmap('jet')
dates = [str(p.date()) for p in stock_ret_df[::len(stock_df)//10].index]
colors = np.linspace(0.1, 1, len(stock_ret_df.index))
sc = plt.scatter(stock_ret_df[stock_1],stock_ret_df[stock_2], s=30, c=colors, cmap=cm, edgecolor='k', alpha=0.7)
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in stock_ret_df[::len(stock_ret_df)//9].index])
plt.xlabel(stock_1)
plt.ylabel(stock_2);
In [27]:
# 现在开始用KalmanFilter的方法进行动态估计beta值
delta = 1e-3
trans_cov = delta/(1-delta)*np.eye(2)
obs_mat = np.vstack([stock_ret_df[stock_1], np.ones(stock_ret_df[stock_1].shape)]).T[:,np.newaxis]
In [28]:
kf = KalmanFilter(n_dim_obs=1, n_dim_state=2,       # ret_2 is 1-dimensional, (alpha, beta) is 2-dimensional
                 initial_state_mean=np.zeros(2),
                 initial_state_covariance=np.ones((2,2)),
                 transition_matrices=np.eye(2),
                 observation_matrices=obs_mat,
                 observation_covariance=0.01,
                 transition_covariance=trans_cov)
In [29]:
state_means, state_covs = kf.filter(stock_ret_df[stock_2])
pd.DataFrame(dict(slope=state_means[:,0], intercept=state_means[:,1]), index=stock_ret_df.index).plot(subplots=True)
plt.tight_layout
print ('alpha :',np.mean(state_means[:,1]))
print ('beta :',np.mean(state_means[:,0]));
alpha : 0.00137472851765
beta : 0.178758327491

相比于静态对冲的方式,我们发现动态对冲的alpha(截距项)值也是在0附近,但是beta值相差较大,主要是因为beta值在不断更新的情形下,最近的值会占据较大的权重,并且初始值如果设定差异过大(此处初始值设为0),会显著影响其均值,而最后时刻更新的beta比较接近静态对冲的值

为了更好地理解Kalman Filter的方法和原理,下次会单独开始一个研究板块进行补充说明

最后,关于两种对冲方式下的表现,我会在策略板块下更新,敬请关注<吐舌>