Sorsogon. Step 1.b Micro-level Income model¶

In [1]:

import datetime; print(datetime.datetime.now())

2017-10-25 14:37:54.542583

Notebook Abstract:

A simple micro-level income model. The following notebook presents the defined income model for the simulation.

The model computes income levels for each household on the proxy sample data. The data used for the estimation of income drivers and their corresponding coefficients is the Philippines Family Income and Expenditure Survey 2009.

Prior income model¶

In [19]:

import statsmodels.api as sm
import pandas as pd
import numpy as np
from urbanmetabolism._scripts.micro import compute_categories, change_index

In [20]:

income_data = pd.read_csv('data/income.csv', index_col=0)
formula = "Total_Family_Income ~\
Family_Size + C(HH_head_Sex) + HH_head_Age + C(Education) + C(Urbanity)"

In [21]:

income_data.head()

Out[21]:

	Family_Size	HH_head_Sex	HH_head_Age	Education	Electricity_expenditure	Water_expenditure	Total_Family_Income
0	5.5	1	52	2.0	1500	0	23939.666667
1	7.5	1	70	1.0	1608	0	16078.166667
2	3.0	1	49	2.0	8880	0	20925.000000
3	2.0	2	51	1.0	900	2190	9932.333333
4	6.0	1	36	1.0	3360	0	13589.500000

In [22]:

model_inc = sm.WLS.from_formula(formula, income_data)
model_results_inc = model_inc.fit()

In [23]:

model_results_inc.summary()

Out[23]:

WLS Regression Results
Dep. Variable:	Total_Family_Income	R-squared:	0.315
Model:	WLS	Adj. R-squared:	0.315
Method:	Least Squares	F-statistic:	1908.
Date:	Mon, 23 Oct 2017	Prob (F-statistic):	0.00
Time:	16:37:05	Log-Likelihood:	-3.5601e+05
No. Observations:	33208	AIC:	7.120e+05
Df Residuals:	33199	BIC:	7.121e+05
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	1147.6640	313.997	3.655	0.000	532.218	1763.110
C(HH_head_Sex)[T.2]	919.0121	161.503	5.690	0.000	602.460	1235.565
C(Education)[T.2.0]	6023.8625	140.904	42.751	0.000	5747.685	6300.040
C(Education)[T.3.0]	1.196e+04	217.209	55.058	0.000	1.15e+04	1.24e+04
C(Education)[T.4.0]	1.873e+04	282.176	66.368	0.000	1.82e+04	1.93e+04
C(Education)[T.5.0]	1.679e+04	742.048	22.624	0.000	1.53e+04	1.82e+04
C(Urbanity)[T.1]	7105.2245	127.941	55.535	0.000	6854.455	7355.994
Family_Size	1666.8464	29.035	57.409	0.000	1609.937	1723.756
HH_head_Age	116.5759	4.681	24.902	0.000	107.400	125.752

Omnibus:	3597.783	Durbin-Watson:	1.606
Prob(Omnibus):	0.000	Jarque-Bera (JB):	4994.573
Skew:	0.865	Prob(JB):	0.00
Kurtosis:	3.786	Cond. No.	642.

In [24]:

params_inc = change_index(model_results_inc.params)
bse_inc = change_index(model_results_inc.bse)
inc = pd.concat([params_inc, bse_inc], axis=1)
inc.columns = ['co_mu', 'co_sd']
inc = compute_categories(inc)

In [25]:

inc.loc['Urbanity', 'p'] = (income_data.Urbanity == 1).sum() / income_data.shape[0]
inc.loc['Sex', 'p'] = (income_data.HH_head_Sex == 2).sum() / income_data.shape[0]

In [26]:

inc.loc[:, 'mu'] = np.nan
inc.loc[:, 'sd'] = np.nan
inc.loc['Intercept', 'p'] = inc.loc['Intercept', 'co_mu']
inc.loc['Intercept', ['co_mu', 'co_sd']] = np.nan

In [27]:

inc.loc['Education','dis'] = 'Categorical'
inc.loc['Urbanity', 'dis'] = 'Bernoulli'
inc.loc['Sex', 'dis'] = 'Bernoulli'
inc.loc['FamilySize', 'dis'] = 'Poisson'
inc.loc['Intercept', 'dis'] = 'Deterministic'
inc.loc['Age', 'dis'] = 'Normal'

In [28]:

inc.loc[:,'ub'] = np.nan
inc.loc[:,'lb'] = np.nan
inc.loc['FamilySize', 'lb'] = 1
inc.loc['FamilySize', 'ub'] = 10
inc.loc['Age', 'ub'] = 100
inc.loc['Age', 'lb'] = 18

In [29]:

inc.index = ['i_'+i for i in inc.index]

In [30]:

inc.to_csv('data/table_inc.csv')

In [31]:

inc

Out[31]:

	co_mu	co_sd	p	mu	sd	dis	ub	lb
i_Intercept	NaN	NaN	1147.663992	NaN	NaN	Deterministic	NaN	NaN
i_Sex	919.012	161.503	0.193718	NaN	NaN	Bernoulli	NaN	NaN
i_Urbanity	7105.22	127.941	0.403005	NaN	NaN	Bernoulli	NaN	NaN
i_FamilySize	1666.85	29.0348	NaN	NaN	NaN	Poisson	10.0	1.0
i_Age	116.576	4.68139	NaN	NaN	NaN	Normal	100.0	18.0
i_Education	1.0,6023.86254599,11959.091528,18727.4606703,1...	1e-10,140.904404522,217.208790314,282.17614554...	NaN	NaN	NaN	Categorical	NaN	NaN

The income model is defined as a table model. This table contains all the required information for the simulation model to construct a proxy sample.

The table model defines the coefficient used for the estimation of income co_mu, with a corresponding standard deviation co_sd. A value to model the distribution (p, mu, sd), the distribution type is defined on column dis. The values ub and lb are used to give the distribution an upper and lower bound.