Logistic Regression (tried to implement) but keep getting too accurate model

Logistic Regression (tried to implement) but keep getting too accurate model

So I've tried to follow the below tutorial and keep getting a 99% accurate algorithm, but I seriously doubt this. i'm trying to predict risk of injury which is highlighted by IntTot below.

https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

Input: Note the column headings are not correct below, first column is index.

TotClear BLD_HEIGHT INITIAL_CALL_HOUR TOTAL_NUM_PERSONNEL InjTot TotClearS TOTAL_NUM_PERSONNELS year ALARM_TO_FD 141677 8.316667 0 17 14 0.0 (7.333, 9.683] (7.0, 594.0] 2011 05 314976 21.483333 0 9 4 0.0 (21.0, 28.45] (-0.001, 4.0] 2013 03 215834 5.666667 0 23 4 0.0 NaN NaN 2012 03 318900 13.966667 0 23 20 0.0 (11.867, 14.167] (7.0, 594.0] 2013 01 468452 4.050000 0 5 4 0.0 (-58.834, 7.333] (-0.001, 4.0] NaN 03 338749 4.916667 0 18 18 0.0 (-58.834, 7.333] (7.0, 594.0] 2013 05 263937 5.833333 0 9 17 0.0 NaN NaN 2012 05 167954 142.833333 0 16 13 0.0 NaN NaN 2012 05 712047 8.900000 0 21 4 0.0 NaN NaN 2016 01 143231 28.883333 0 17 4 0.0 (28.45, 4321.35] (-0.001, 4.0]

So at first I clean the data, I try to get it into where a feature value is represented by a binary column. That is what the for loop does. It takes the columns I ahve interest in, extracts values and sets binary columns. I go from 10 ish columns to 144 columns. I first have to take InjTot and convert to a boolean. I will perform recursive feature extraction to obtain relevant columns for analysis.

train['event_type'] = pd.Categorical(train['event_type']).codes cat_vars=['TotClearS','TOTAL_NUM_PERSONNELS','year','ALARM_TO_FD','event_type'] for var in cat_vars: cat_list='var'+'_'+var cat_list = pd.get_dummies(train[var], prefix=var) data1=train.join(cat_list) train=data1 data_vars=train.columns.values.tolist() to_keep=[i for i in data_vars if i not in cat_vars] train['InjInt'] = train['InjTot'].apply(lambda x: 1 if x >0 else 0 ) train = train.drop(['TotClearS','InjTot','TOTAL_NUM_PERSONNELS','year','ALARM_TO_FD','event_type','BLD_HEIGHT', 'INITIAL_CALL_HOUR','TOTAL_NUM_PERSONNEL'],axis=1) y = train['InjInt'] X=[i for i in train if i not in y]

Here perform the regression with the matrix of train whose columns of X I have interest in. I do not include the target column in my definition of X (InjInt)

logreg = LogisticRegression() rfe = RFE(logreg, 15) rfe = rfe.fit(train[X], y) X_train, X_test, y_train, y_test = train_test_split(train[X], y, test_size=0.3, random_state=0) from sklearn.linear_model import LogisticRegression from sklearn import metrics logreg = LogisticRegression() logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test) print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Now I followed the tutorial for the most part just not sure why this is returning 99% accuracy unless I completely missed the boat or have an error in my analysis...

You can try to check the metrics using cross validation scikit-learn.org/stable/modules/cross_validation.html
– mad_
14 hours ago

From the values of InjTot, this seems like a regression task (where you are trying to predict numerical values of InjTot). But then you use LogisticRegression and accuracy, which are used for Classification tasks, not regression. For this case, LogisticRegression will treat your y values as hard classes and only predict values from them. So your y has unique values:- [4, 13, 14, 17, 18, 20]. So LR will always predict from these only, never 5, or 6 or 10, or 15 etc. Are you sure you want this? Please read more about classification and regression.
– Vivek Kumar
6 mins ago

y

[4, 13, 14, 17, 18, 20]

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

YE,x0C1M0SIfXiEoYeFcoZR3POy7DvU,oO,dDI4aEz9 AI,yqQZMWw6qd2KOC8EJ u0bDKGFZ,jsysMd KaZ UZq 9pwcp

搜尋此網誌

Ciugk