Sorsogon. Step 1.b Micro-level Income model

In [1]:
import datetime; print(datetime.datetime.now())
2017-10-25 14:37:54.542583

Notebook Abstract:

A simple micro-level income model. The following notebook presents the defined income model for the simulation.

The model computes income levels for each household on the proxy sample data. The data used for the estimation of income drivers and their corresponding coefficients is the Philippines Family Income and Expenditure Survey 2009.

Prior income model

In [19]:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from urbanmetabolism._scripts.micro import compute_categories, change_index
In [20]:
income_data = pd.read_csv('data/income.csv', index_col=0)
formula = "Total_Family_Income ~\
Family_Size + C(HH_head_Sex) + HH_head_Age + C(Education) + C(Urbanity)"
In [21]:
income_data.head()
Out[21]:
Family_Size HH_head_Sex HH_head_Age Education Electricity_expenditure Water_expenditure Total_Family_Income Urbanity
0 5.5 1 52 2.0 1500 0 23939.666667 0
1 7.5 1 70 1.0 1608 0 16078.166667 0
2 3.0 1 49 2.0 8880 0 20925.000000 0
3 2.0 2 51 1.0 900 2190 9932.333333 0
4 6.0 1 36 1.0 3360 0 13589.500000 0
In [22]:
model_inc = sm.WLS.from_formula(formula, income_data)
model_results_inc = model_inc.fit()
In [23]:
model_results_inc.summary()
Out[23]:
WLS Regression Results
Dep. Variable: Total_Family_Income R-squared: 0.315
Model: WLS Adj. R-squared: 0.315
Method: Least Squares F-statistic: 1908.
Date: Mon, 23 Oct 2017 Prob (F-statistic): 0.00
Time: 16:37:05 Log-Likelihood: -3.5601e+05
No. Observations: 33208 AIC: 7.120e+05
Df Residuals: 33199 BIC: 7.121e+05
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1147.6640 313.997 3.655 0.000 532.218 1763.110
C(HH_head_Sex)[T.2] 919.0121 161.503 5.690 0.000 602.460 1235.565
C(Education)[T.2.0] 6023.8625 140.904 42.751 0.000 5747.685 6300.040
C(Education)[T.3.0] 1.196e+04 217.209 55.058 0.000 1.15e+04 1.24e+04
C(Education)[T.4.0] 1.873e+04 282.176 66.368 0.000 1.82e+04 1.93e+04
C(Education)[T.5.0] 1.679e+04 742.048 22.624 0.000 1.53e+04 1.82e+04
C(Urbanity)[T.1] 7105.2245 127.941 55.535 0.000 6854.455 7355.994
Family_Size 1666.8464 29.035 57.409 0.000 1609.937 1723.756
HH_head_Age 116.5759 4.681 24.902 0.000 107.400 125.752
Omnibus: 3597.783 Durbin-Watson: 1.606
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4994.573
Skew: 0.865 Prob(JB): 0.00
Kurtosis: 3.786 Cond. No. 642.
In [24]:
params_inc = change_index(model_results_inc.params)
bse_inc = change_index(model_results_inc.bse)
inc = pd.concat([params_inc, bse_inc], axis=1)
inc.columns = ['co_mu', 'co_sd']
inc = compute_categories(inc)
In [25]:
inc.loc['Urbanity', 'p'] = (income_data.Urbanity == 1).sum() / income_data.shape[0]
inc.loc['Sex', 'p'] = (income_data.HH_head_Sex == 2).sum() / income_data.shape[0]
In [26]:
inc.loc[:, 'mu'] = np.nan
inc.loc[:, 'sd'] = np.nan
inc.loc['Intercept', 'p'] = inc.loc['Intercept', 'co_mu']
inc.loc['Intercept', ['co_mu', 'co_sd']] = np.nan
In [27]:
inc.loc['Education','dis'] = 'Categorical'
inc.loc['Urbanity', 'dis'] = 'Bernoulli'
inc.loc['Sex', 'dis'] = 'Bernoulli'
inc.loc['FamilySize', 'dis'] = 'Poisson'
inc.loc['Intercept', 'dis'] = 'Deterministic'
inc.loc['Age', 'dis'] = 'Normal'
In [28]:
inc.loc[:,'ub'] = np.nan
inc.loc[:,'lb'] = np.nan
inc.loc['FamilySize', 'lb'] = 1
inc.loc['FamilySize', 'ub'] = 10
inc.loc['Age', 'ub'] = 100
inc.loc['Age', 'lb'] = 18
In [29]:
inc.index = ['i_'+i for i in inc.index]
In [30]:
inc.to_csv('data/table_inc.csv')
In [31]:
inc
Out[31]:
co_mu co_sd p mu sd dis ub lb
i_Intercept NaN NaN 1147.663992 NaN NaN Deterministic NaN NaN
i_Sex 919.012 161.503 0.193718 NaN NaN Bernoulli NaN NaN
i_Urbanity 7105.22 127.941 0.403005 NaN NaN Bernoulli NaN NaN
i_FamilySize 1666.85 29.0348 NaN NaN NaN Poisson 10.0 1.0
i_Age 116.576 4.68139 NaN NaN NaN Normal 100.0 18.0
i_Education 1.0,6023.86254599,11959.091528,18727.4606703,1... 1e-10,140.904404522,217.208790314,282.17614554... NaN NaN NaN Categorical NaN NaN

The income model is defined as a table model. This table contains all the required information for the simulation model to construct a proxy sample.

The table model defines the coefficient used for the estimation of income co_mu, with a corresponding standard deviation co_sd. A value to model the distribution (p, mu, sd), the distribution type is defined on column dis. The values ub and lb are used to give the distribution an upper and lower bound.