Sorsogon. Step 1.b Micro-level Income model¶
In [1]:
import datetime; print(datetime.datetime.now())
2017-10-25 14:37:54.542583
Notebook Abstract:
A simple micro-level income model. The following notebook presents the defined income model for the simulation.
The model computes income levels for each household on the proxy sample data. The data used for the estimation of income drivers and their corresponding coefficients is the Philippines Family Income and Expenditure Survey 2009.
Prior income model¶
In [19]:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from urbanmetabolism._scripts.micro import compute_categories, change_index
In [20]:
income_data = pd.read_csv('data/income.csv', index_col=0)
formula = "Total_Family_Income ~\
Family_Size + C(HH_head_Sex) + HH_head_Age + C(Education) + C(Urbanity)"
In [21]:
income_data.head()
Out[21]:
Family_Size | HH_head_Sex | HH_head_Age | Education | Electricity_expenditure | Water_expenditure | Total_Family_Income | Urbanity | |
---|---|---|---|---|---|---|---|---|
0 | 5.5 | 1 | 52 | 2.0 | 1500 | 0 | 23939.666667 | 0 |
1 | 7.5 | 1 | 70 | 1.0 | 1608 | 0 | 16078.166667 | 0 |
2 | 3.0 | 1 | 49 | 2.0 | 8880 | 0 | 20925.000000 | 0 |
3 | 2.0 | 2 | 51 | 1.0 | 900 | 2190 | 9932.333333 | 0 |
4 | 6.0 | 1 | 36 | 1.0 | 3360 | 0 | 13589.500000 | 0 |
In [22]:
model_inc = sm.WLS.from_formula(formula, income_data)
model_results_inc = model_inc.fit()
In [23]:
model_results_inc.summary()
Out[23]:
Dep. Variable: | Total_Family_Income | R-squared: | 0.315 |
---|---|---|---|
Model: | WLS | Adj. R-squared: | 0.315 |
Method: | Least Squares | F-statistic: | 1908. |
Date: | Mon, 23 Oct 2017 | Prob (F-statistic): | 0.00 |
Time: | 16:37:05 | Log-Likelihood: | -3.5601e+05 |
No. Observations: | 33208 | AIC: | 7.120e+05 |
Df Residuals: | 33199 | BIC: | 7.121e+05 |
Df Model: | 8 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 1147.6640 | 313.997 | 3.655 | 0.000 | 532.218 | 1763.110 |
C(HH_head_Sex)[T.2] | 919.0121 | 161.503 | 5.690 | 0.000 | 602.460 | 1235.565 |
C(Education)[T.2.0] | 6023.8625 | 140.904 | 42.751 | 0.000 | 5747.685 | 6300.040 |
C(Education)[T.3.0] | 1.196e+04 | 217.209 | 55.058 | 0.000 | 1.15e+04 | 1.24e+04 |
C(Education)[T.4.0] | 1.873e+04 | 282.176 | 66.368 | 0.000 | 1.82e+04 | 1.93e+04 |
C(Education)[T.5.0] | 1.679e+04 | 742.048 | 22.624 | 0.000 | 1.53e+04 | 1.82e+04 |
C(Urbanity)[T.1] | 7105.2245 | 127.941 | 55.535 | 0.000 | 6854.455 | 7355.994 |
Family_Size | 1666.8464 | 29.035 | 57.409 | 0.000 | 1609.937 | 1723.756 |
HH_head_Age | 116.5759 | 4.681 | 24.902 | 0.000 | 107.400 | 125.752 |
Omnibus: | 3597.783 | Durbin-Watson: | 1.606 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 4994.573 |
Skew: | 0.865 | Prob(JB): | 0.00 |
Kurtosis: | 3.786 | Cond. No. | 642. |
In [24]:
params_inc = change_index(model_results_inc.params)
bse_inc = change_index(model_results_inc.bse)
inc = pd.concat([params_inc, bse_inc], axis=1)
inc.columns = ['co_mu', 'co_sd']
inc = compute_categories(inc)
In [25]:
inc.loc['Urbanity', 'p'] = (income_data.Urbanity == 1).sum() / income_data.shape[0]
inc.loc['Sex', 'p'] = (income_data.HH_head_Sex == 2).sum() / income_data.shape[0]
In [26]:
inc.loc[:, 'mu'] = np.nan
inc.loc[:, 'sd'] = np.nan
inc.loc['Intercept', 'p'] = inc.loc['Intercept', 'co_mu']
inc.loc['Intercept', ['co_mu', 'co_sd']] = np.nan
In [27]:
inc.loc['Education','dis'] = 'Categorical'
inc.loc['Urbanity', 'dis'] = 'Bernoulli'
inc.loc['Sex', 'dis'] = 'Bernoulli'
inc.loc['FamilySize', 'dis'] = 'Poisson'
inc.loc['Intercept', 'dis'] = 'Deterministic'
inc.loc['Age', 'dis'] = 'Normal'
In [28]:
inc.loc[:,'ub'] = np.nan
inc.loc[:,'lb'] = np.nan
inc.loc['FamilySize', 'lb'] = 1
inc.loc['FamilySize', 'ub'] = 10
inc.loc['Age', 'ub'] = 100
inc.loc['Age', 'lb'] = 18
In [29]:
inc.index = ['i_'+i for i in inc.index]
In [30]:
inc.to_csv('data/table_inc.csv')
In [31]:
inc
Out[31]:
co_mu | co_sd | p | mu | sd | dis | ub | lb | |
---|---|---|---|---|---|---|---|---|
i_Intercept | NaN | NaN | 1147.663992 | NaN | NaN | Deterministic | NaN | NaN |
i_Sex | 919.012 | 161.503 | 0.193718 | NaN | NaN | Bernoulli | NaN | NaN |
i_Urbanity | 7105.22 | 127.941 | 0.403005 | NaN | NaN | Bernoulli | NaN | NaN |
i_FamilySize | 1666.85 | 29.0348 | NaN | NaN | NaN | Poisson | 10.0 | 1.0 |
i_Age | 116.576 | 4.68139 | NaN | NaN | NaN | Normal | 100.0 | 18.0 |
i_Education | 1.0,6023.86254599,11959.091528,18727.4606703,1... | 1e-10,140.904404522,217.208790314,282.17614554... | NaN | NaN | NaN | Categorical | NaN | NaN |
The income model is defined as a table model. This table contains all the required information for the simulation model to construct a proxy sample.
The table model defines the coefficient used for the estimation of
income co_mu
, with a corresponding standard deviation co_sd
. A
value to model the distribution (p
, mu
, sd
), the
distribution type is defined on column dis
. The values ub
and
lb
are used to give the distribution an upper and lower bound.