Multiple models Nesarc data (with Python)

1 minute read

image-center

Can we predict the antisocial disorder among young girls aged between 18-28 years old?

Background

On the last post I used decision tree for prediction. On this one I will used random forest and logistic regression. The predictors are:

S1Q6A HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED
S1Q11A TOTAL FAMILY INCOME IN LAST 12 MONTHS
MAJORDEPLIFE MAJOR DEPRESSION IN LAST 12 MONTHS
SOCPDLIFE SOCIAL PHOBIA - LIFETIME (NON-HIERARCHICAL)
GENAXLIFE GENERALIZED ANXIETY DISORDER - LIFETIME
HISTDX2 HISTRIONIC PERSONALITY DISORDER (LIFETIME DIAGNOSIS)
S11BQ1 (BLOOD/NATURAL FATHER EVER HAD BEHAVIOR PROBLEMS)
REGION

Here I use sklearn pipeline.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

cat_col = ['MAJORDEPLIFE', 'HISTDX2', 'S11BQ1', 'SOCPDLIFE', 'GENAXLIFE', 'REGION']
num_col = ['S1Q11A', 'S1Q6A']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preproces = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_col),
        ('cat', categorical_transformer, cat_col)])

classifiers = [
    RandomForestClassifier(),
    LogisticRegression()] 

#  Create the pipeline for each model and evaluate
acc = []
model_names = []
for i, clf in enumerate(classifiers):
    
    # Define pipeline with the model
    full_pipeline = Pipeline(steps=[('preprocessing', preproces),
                                    ('model', clf)])

    #  Fit training data and define number of folds in Cross-Validation
    full_pipeline.fit(X_train, y_train)
    acc.append(cross_val_score(full_pipeline, X_train, y_train, scoring = 'accuracy', cv = 5))
    
    #  Display accuracy
    model_names.append(clf.__class__.__name__)
    print('{} Training Score: {}'.format(model_names[i], round(full_pipeline.score(X_train, y_train),4)))

    print('{} Testing Score: {}'.format(model_names[i], round(full_pipeline.score(X_test, y_test),4)))
    
#  Boxplot to visualize the scores
plt.boxplot(acc, labels=model_names, showmeans=True)
plt.xlabel('Models')
plt.ylabel('Accuracy')

Using cross validation, we can see that logistic regression has a higher accuracy than random forest.

Most important features?

The most important feature to predict the antisocial disorder among young girls are:

S1Q11A TOTAL FAMILY INCOME IN LAST 12 MONTHS
S1Q6A HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED
REGION

To conclude

Family of Young girls with a antisocial disorder (coded 1 = “YES” and 0 = “NO”) has lower income than those without. The income is a good predictor of this disorder.

Share on

Twitter Facebook Google+ LinkedIn

Danielle Taneyo

Multiple models Nesarc data (with Python)

Can we predict the antisocial disorder among young girls aged between 18-28 years old?

Background

Most important features?

To conclude

Share on

You May Also Enjoy

SimpleStats or statistics for all

Predict heart failure using ML

Find foods rich in some nutrients

Decision Tree Nesarc data (with Python)