Logistic Regression (tried to implement) but keep getting too accurate model
Logistic Regression (tried to implement) but keep getting too accurate model
So I've tried to follow the below tutorial and keep getting a 99% accurate algorithm, but I seriously doubt this. i'm trying to predict risk of injury which is highlighted by IntTot below.
https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
Input: Note the column headings are not correct below, first column is index.
TotClear BLD_HEIGHT INITIAL_CALL_HOUR TOTAL_NUM_PERSONNEL InjTot TotClearS TOTAL_NUM_PERSONNELS year ALARM_TO_FD
141677 8.316667 0 17 14 0.0 (7.333, 9.683] (7.0, 594.0] 2011 05
314976 21.483333 0 9 4 0.0 (21.0, 28.45] (-0.001, 4.0] 2013 03
215834 5.666667 0 23 4 0.0 NaN NaN 2012 03
318900 13.966667 0 23 20 0.0 (11.867, 14.167] (7.0, 594.0] 2013 01
468452 4.050000 0 5 4 0.0 (-58.834, 7.333] (-0.001, 4.0] NaN 03
338749 4.916667 0 18 18 0.0 (-58.834, 7.333] (7.0, 594.0] 2013 05
263937 5.833333 0 9 17 0.0 NaN NaN 2012 05
167954 142.833333 0 16 13 0.0 NaN NaN 2012 05
712047 8.900000 0 21 4 0.0 NaN NaN 2016 01
143231 28.883333 0 17 4 0.0 (28.45, 4321.35] (-0.001, 4.0]
So at first I clean the data, I try to get it into where a feature value is represented by a binary column. That is what the for loop does. It takes the columns I ahve interest in, extracts values and sets binary columns. I go from 10 ish columns to 144 columns. I first have to take InjTot and convert to a boolean. I will perform recursive feature extraction to obtain relevant columns for analysis.
train['event_type'] = pd.Categorical(train['event_type']).codes
cat_vars=['TotClearS','TOTAL_NUM_PERSONNELS','year','ALARM_TO_FD','event_type']
for var in cat_vars:
cat_list='var'+'_'+var
cat_list = pd.get_dummies(train[var], prefix=var)
data1=train.join(cat_list)
train=data1
data_vars=train.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]
train['InjInt'] = train['InjTot'].apply(lambda x: 1 if x >0 else 0 )
train = train.drop(['TotClearS','InjTot','TOTAL_NUM_PERSONNELS','year','ALARM_TO_FD','event_type','BLD_HEIGHT', 'INITIAL_CALL_HOUR','TOTAL_NUM_PERSONNEL'],axis=1)
y = train['InjInt']
X=[i for i in train if i not in y]
Here perform the regression with the matrix of train whose columns of X I have interest in. I do not include the target column in my definition of X (InjInt)
logreg = LogisticRegression()
rfe = RFE(logreg, 15)
rfe = rfe.fit(train[X], y)
X_train, X_test, y_train, y_test = train_test_split(train[X], y, test_size=0.3, random_state=0)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
Now I followed the tutorial for the most part just not sure why this is returning 99% accuracy unless I completely missed the boat or have an error in my analysis...
From the values of InjTot, this seems like a regression task (where you are trying to predict numerical values of InjTot). But then you use LogisticRegression and accuracy, which are used for Classification tasks, not regression. For this case, LogisticRegression will treat your
y
values as hard classes and only predict values from them. So your y
has unique values:- [4, 13, 14, 17, 18, 20]
. So LR will always predict from these only, never 5, or 6 or 10, or 15 etc. Are you sure you want this? Please read more about classification and regression.– Vivek Kumar
6 mins ago
y
y
[4, 13, 14, 17, 18, 20]
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
You can try to check the metrics using cross validation scikit-learn.org/stable/modules/cross_validation.html
– mad_
14 hours ago