은행 고객 이탈 분석 프로젝트 DAY 1

주제 : [분류] 은행 고객 이탈여부 분류
사용 데이터 : https://www.kaggle.com/datasets/shubhammeshram579/bank-customer-churn-prediction/data
프로젝트 목표 : 고객 데이터를 분석하여 이탈 유저를 예측하는 분류 모델을 개발하고, 이를 통해 고객 유형별 리텐션 향상 전략을 수립한다.
프로젝트 핵심내용 : 다양한 머신러닝 모델을 활용하여 성과 지표를 분석해 최적의 모델을 선정하고, 고객을 클러스터링한 후 이탈이 예측되는 그룹별 이탈 방지 전략을 제시한다.
필수로 담을 내용 : 1. 모델 별 성과 표 2. 고객 클러스터링 => 구체적인 전략 제시

Bank Customer Churn Prediction

Predicting customer churn in banking industry using machine learning.

www.kaggle.com

EDA

요약

	컬럼명	한글 컬럼명	결측치	이상치	데이터 종류	데이터타입
1	RowNumber	행 번호	0		-	int64
2	CustomerId	고객 ID	0		-	int64
3	Surname	고객 이름	0		-	object
4	CreditScore	신용 점수	0	이상치 개수: 15 Lower : 383.0 Upper : 919.0	연속형	int64
5	Geography	국가	1개. CustomerId 15592531		범주형	object
6	Gender	성별	0		범주형	object
7	Age	나이	1개. CustomerId 15592389	이상치 개수: 359 Lower : 14.0 Upper : 62.0	연속형	float64
8	Tenure	가입 기간	0		이산형	int64
9	Balance	계좌 잔액	0	이상치 개수: 0 Lower : 383.0 Upper : 919.0	연속형	float64
10	NumOfProducts	사용하는 은행 상품 수	0		이산형	int64
11	HasCrCard	신용카드 여부	1개. CustomerId 15737888		범주형	float64
12	IsActiveMember	현재 활동 회원 여부	1개. CustomerId 15792365		범주형	float64
13	EstimatedSalary	예상 연봉	0	이상치 개수: 0 Lower : -140488.01 Upper : 340855.4125	연속형	float64
14	Exited	이탈 여부	0		범주형	int64

1. 결측치 확인

# 코랩 사전세팅
# ▶ Warnings 제거
import warnings
warnings.filterwarnings('ignore')

# ▶ Google drive mount or 폴더 클릭 후 구글드라이브 연결
from google.colab import drive
drive.mount('/content/drive')

# ▶ 경로 설정 (※강의자료가 위치에 있는 경로 확인)
import os
os.chdir('/content/drive/MyDrive')   # 파일명 제외! 경로 위치만!!
os.getcwd()

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('data/Churn_Modelling.csv')
df

df.info()

# 결측치 확인
df.isna().sum()

# 결측치가 있는 데이터는 같은 행일까? -> No. 각기 다른 고객임
df[df['Geography'].isna()==True]['CustomerId']
df[df['Age'].isna()==True]['CustomerId']
df[df['HasCrCard'].isna()==True]['CustomerId']
df[df['IsActiveMember'].isna()==True]['CustomerId']

2.중복값 확인

전체 10,002행 중에 동일한 고객 데이터가 2쌍 있어서 해당 데이터는 하나만 남겨야 함

duplicates = df['CustomerId'].value_counts()[lambda x: x > 1]
duplicates

3.이상치 확인

df.describe()

credit score 컬럼 이상치 확인 -> 15개. 그냥 수치가 낮을뿐 잘못 집계되었거나 분석 상 제거할 수준으로 보이지 않아 유지.

# 연속형변수 이상치 확인 - credit score
# iqr 기법
credit_q1 = np.percentile(df['CreditScore'], 25)
credit_q3 = np.percentile(df['CreditScore'], 75)
credit_iqr = credit_q3 - credit_q1
credit_lower_bound = credit_q1 - 1.5 * credit_iqr
credit_upper_bound = credit_q3 + 1.5 * credit_iqr

# 이상치 확인
credit_score_outliers = df[(df['CreditScore'] < credit_lower_bound) | (df['CreditScore'] > credit_upper_bound)]
print('CreditScore 이상치 개수:', len(credit_score_outliers))
print('Lower Bound: ',credit_lower_bound,'\nUpper Bound: ', credit_upper_bound)
print(credit_score_outliers)

# # 시각화 : boxplot
# plt.figure(figsize = (10, 5))
# plt.boxplot(df['CreditScore'])
# plt.show()

# 시각화 - histogram
sns.histplot(df['CreditScore'], bins=30, edgecolor='black', color='skyblue')
plt.show()
# 시각화 - boxplot
sns.boxplot(df['CreditScore'])

age 컬럼 이상치 확인 -> 359개. 고령 이상치가 발견되었으나 최대 92세로 데이터에서 제거해야 할 이상치로는 보이지 않아 유지.

# 연속형변수 이상치 확인 - Age
# iqr 기법
age_cleaned = df['Age'].dropna() # 결측치때문에 결과가 NaN 나와서 결측치 제거

age_q1 = np.percentile(age_cleaned, 25)
age_q3 = np.percentile(age_cleaned, 75)
age_iqr = age_q3 - age_q1
age_lower_bound = age_q1 - 1.5 * age_iqr
age_upper_bound = age_q3 + 1.5 * age_iqr

age_outliers = df[(df['Age'] < age_lower_bound) | (df['Age'] > age_upper_bound)]
print('Age 이상치 개수:', len(age_outliers))
print('Lower Bound: ',age_lower_bound,'\nUpper Bound: ', age_upper_bound)
print(age_outliers)

# 시각화 - histogram
sns.histplot(age_cleaned, bins=30, edgecolor='black', color='skyblue')
plt.show()
# 시각화 - boxplot
sns.boxplot(age_cleaned)

balance 컬럼 이상치 확인 -> 없음.

# 연속형변수 이상치 확인 - Balance
# iqr 기법
balance_q1 = np.percentile(df['Balance'], 25)
balance_q3 = np.percentile(df['Balance'], 75)
balance_iqr = balance_q3 - balance_q1
balance_lower_bound = balance_q1 - 1.5 * balance_iqr
balance_upper_bound = balance_q3 + 1.5 * balance_iqr
# 이상치 확인
balance_score_outliers = df[(df['Balance'] < balance_lower_bound) | (df['Balance'] > balance_upper_bound)]
print('Balance 이상치 개수:', len(balance_score_outliers))
print('Lower Bound: ',balance_lower_bound,'\nUpper Bound: ', balance_upper_bound)
print(balance_score_outliers)

# 시각화 - histogram
sns.histplot(df['Balance'], bins=30, edgecolor='black', color='skyblue')
plt.show()
# 시각화 - boxplot
sns.boxplot(df['Balance'])

estimated salary 컬럼 이상치 확인 -> 없음.

# 연속형변수 이상치 확인 - EstimatedSalary
# iqr 기법
est_salary_q1 = np.percentile(df['EstimatedSalary'], 25)
est_salary_q3 = np.percentile(df['EstimatedSalary'], 75)
est_salary_iqr = est_salary_q3 - est_salary_q1
est_salary_lower_bound = est_salary_q1 - 1.5 * balance_iqr
est_salary_upper_bound = est_salary_q3 + 1.5 * balance_iqr
# 이상치 확인
est_salary_outliers = df[(df['EstimatedSalary'] < est_salary_lower_bound) | (df['EstimatedSalary'] > est_salary_upper_bound)]
print('EstimatedSalary 이상치 개수:', len(est_salary_outliers))
print('Lower Bound: ',est_salary_lower_bound,'\nUpper Bound: ', est_salary_upper_bound)
print(est_salary_outliers)

# 시각화 - histogram
sns.histplot(df['EstimatedSalary'], bins=30, edgecolor='black', color='skyblue')
plt.show()
# 시각화 - boxplot
sns.boxplot(df['EstimatedSalary'])

이산형 변수 vs 연속형 변수 vs 범주형 변수

# 변수 파악
df.describe(include = 'all')

# 데이터 타입 파악
df.info()

변수간 상관관계

# 숫자형 데이터간의 상관관계 보기
numeric_df = df.select_dtypes(include=['float64', 'int64'])
numeric_corr = numeric_df.corr()

# 히트맵 시각화
sns.heatmap(numeric_corr, annot=True, cmap='Blues', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap (Numeric-data only)")
plt.show()

# 범주형 데이터도 포함해서 변수간 상관관계 보기
from sklearn.preprocessing import LabelEncoder

df_corr = df.drop(columns=['Surname','RowNumber','CustomerId'])
df['Geography'] = LabelEncoder().fit_transform(df['Geography'])
df['Gender'] = LabelEncoder().fit_transform(df['Gender'])
correlation_matrix = df_corr.corr()

# 히트맵 시각화
sns.heatmap(correlation_matrix, annot=True, cmap='Blues', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

이탈 고객과 활동 회원에 대한 고민

Exited(이탈)임과 동시에 active(활동)인 고객이 735명이나 됨. 이런 경우는 뭘까?

# 활동회원인데 이탈고객인 경우 
df[(df['IsActiveMember']==1)&(df['Exited']==1)]

이탈 고객으로 정의될 수 있는 요인은 너무 다양함

# 이탈(Exited) 고객은 뭘까?
# 1.은행에 잔액 없는 고객 (Balance= 0)
# 2.활동성이 없는 고객(IsActiveMember =0)
# 3.신용카드가 없는 고객(HasCrCard=0)
# 4.가입된 상품이 없는 고객(NumOfProducts=0)
# 1~4번 중 하나라도 해당하는 고객 숫자는 7735개
est_exit = df[(df.Balance== 0)|(df.IsActiveMember== 0)|(df.HasCrCard== 0)|(df.NumOfProducts== 0)]['CustomerId'].unique()
len(est_exit)

그래서 가장 문제가 되는 '활동성'에 대해 고민해보고 나름의 정의를 해보기로 했다

EDA

요약

1. 결측치 확인

2.중복값 확인

3.이상치 확인

이산형 변수 vs 연속형 변수 vs 범주형 변수

변수간 상관관계

이탈 고객과 활동 회원에 대한 고민

티스토리툴바