Multiple models Nesarc data (with Python)

1 minute read

image-center

Can we predict the antisocial disorder among young girls aged between 18-28 years old?

Background

On the last post I used decision tree for prediction. On this one I will used random forest and logistic regression. The predictors are:

  • S1Q6A HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED
  • S1Q11A TOTAL FAMILY INCOME IN LAST 12 MONTHS
  • MAJORDEPLIFE MAJOR DEPRESSION IN LAST 12 MONTHS
  • SOCPDLIFE SOCIAL PHOBIA - LIFETIME (NON-HIERARCHICAL)
  • GENAXLIFE GENERALIZED ANXIETY DISORDER - LIFETIME
  • HISTDX2 HISTRIONIC PERSONALITY DISORDER (LIFETIME DIAGNOSIS)
  • S11BQ1 (BLOOD/NATURAL FATHER EVER HAD BEHAVIOR PROBLEMS)
  • REGION

Here I use sklearn pipeline.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

cat_col = ['MAJORDEPLIFE', 'HISTDX2', 'S11BQ1', 'SOCPDLIFE', 'GENAXLIFE', 'REGION']
num_col = ['S1Q11A', 'S1Q6A']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preproces = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_col),
        ('cat', categorical_transformer, cat_col)])

classifiers = [
    RandomForestClassifier(),
    LogisticRegression()] 

#  Create the pipeline for each model and evaluate
acc = []
model_names = []
for i, clf in enumerate(classifiers):
    
    # Define pipeline with the model
    full_pipeline = Pipeline(steps=[('preprocessing', preproces),
                                    ('model', clf)])

    #  Fit training data and define number of folds in Cross-Validation
    full_pipeline.fit(X_train, y_train)
    acc.append(cross_val_score(full_pipeline, X_train, y_train, scoring = 'accuracy', cv = 5))
    
    #  Display accuracy
    model_names.append(clf.__class__.__name__)
    print('{} Training Score: {}'.format(model_names[i], round(full_pipeline.score(X_train, y_train),4)))

    print('{} Testing Score: {}'.format(model_names[i], round(full_pipeline.score(X_test, y_test),4)))
    
#  Boxplot to visualize the scores
plt.boxplot(acc, labels=model_names, showmeans=True)
plt.xlabel('Models')
plt.ylabel('Accuracy')

Using cross validation, we can see that logistic regression has a higher accuracy than random forest.

Most important features?

The most important feature to predict the antisocial disorder among young girls are:

  • S1Q11A TOTAL FAMILY INCOME IN LAST 12 MONTHS
  • S1Q6A HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED
  • REGION

To conclude

Family of Young girls with a antisocial disorder (coded 1 = “YES” and 0 = “NO”) has lower income than those without. The income is a good predictor of this disorder.