Python 실습 | 범주형 변수를 수치형 변수로 바꾸기

문제 1.

선행커맨드

import seaborn as sns
import pandas as pd

Seaborn 라이브러리에서 제공하는 Diamond 데이터셋을 활용하여 다음을 수행하세요:
1. x, y, z 중 하나라도 0인 데이터를 삭제하세요
2. x, y, z 를 곱해 'volume' 칼럼을 생성하세요.
3. 범주형 변수인 cut, color, clarity를 머신러닝 모델에 활용할 수 있도록 수치형으로 변환합니다.
  - cut: Fair = 1, Good = 2, Very Good = 3, Premium = 4, Ideal = 5로 인코딩하세요.
  - color: E, I, J, H, F, G, D를 각각 1부터 7로 인코딩하세요.
  - clarity:
    - SI2와 SI1 → 1 (S 그룹),
    - VS1와 VS2 → 2 (VS 그룹),
    - VVS2와 VVS1 → 3 (VVS 그룹),
    - I1과 IF → 4 (I 그룹).
skeleton code

diamonds = sns.load_dataset("diamonds")

# Categorical -> String 변환
diamonds["cut"] = diamonds["cut"].astype(str)
diamonds["color"] = diamonds["color"].astype(str)
diamonds["clarity"] = diamonds["clarity"].astype(str)

"""
코드 입력
"""

# 결과 출력
print(diamonds.shape)
print(diamonds.head())

출력 결과 예시

'''
(53920, 11)
   carat cut color clarity  depth  table  price     x     y     z     volume
0   0.23   5     1       1   61.5   55.0    326  3.95  3.98  2.43  38.202030
1   0.21   4     1       1   59.8   61.0    326  3.89  3.84  2.31  34.505856
2   0.23   2     1       2   56.9   65.0    327  4.05  4.07  2.31  38.076885
3   0.29   4     2       2   62.4   58.0    334  4.20  4.23  2.63  46.724580
4   0.31   2     3       1   63.3   58.0    335  4.34  4.35  2.75  51.917250
'''

💡 문제 풀이

참고) 범주형 변수를 수치형 변수로 변경하는 방법

딕셔너리 + lambda 활용
replace 활용
loc 활용

`문제 풀이 1`

데이터 삭제 시, drop 활용
변수 변환 시, 딕셔너리 + lambda 활용

diamonds = sns.load_dataset("diamonds")

# Categorical -> String 변환
diamonds["cut"] = diamonds["cut"].astype(str)
diamonds["color"] = diamonds["color"].astype(str)
diamonds["clarity"] = diamonds["clarity"].astype(str)

# 1) x, y, z 중 하나라도 0인 데이터를 삭제하세요
# 2) x, y, z 를 곱해 'volume' 칼럼을 생성하세요.
diamonds['volume'] = diamonds['x']*diamonds['y']*diamonds['z']
diamonds = diamonds.drop(diamonds[diamonds['volume'] == 0].index)

# 3) 범주형 변수인 cut, color, clarity를 머신러닝 모델에 활용할 수 있도록 수치형으로 변환합니다.
cut_dictionary = {'Fair' : 1 , 'Good' : 2, 'Very Good' : 3, 'Premium' : 4, 'Ideal' : 5}
color_dictionary = {'E':1,'I':2,'J':3,'H':4,'F':5,'G':6,'D':7}
clarity_dictionary = {'SI2':1,'SI1':1,'VS1':2,'VS2':2,'VVS2':3,'VVS1':3,'I1':4,'IF':4}

diamonds['cut'] = diamonds['cut'].apply(lambda x: cut_dictionary[x])
diamonds['color'] = diamonds['color'].apply(lambda x: color_dictionary[x])
diamonds['clarity'] = diamonds['clarity'].apply(lambda x: clarity_dictionary[x])

#결과출력
print(diamonds.shape) 
print(diamonds.head())

`오답노트`

x,y,z곱한 volume컬럼을 먼저 만들고 셋 중 하나라도 0이면 곱한 값이 0이니까 그 다음에 0인 값 index를 찾아서 drop!
`drop` : df.drop(index)
범주형 데이터를 수치형 데이터로 변환할 때 딕셔너리 사용 가능!
`딕셔너리` : dictionary[key] = value

`문제 풀이2`

데이터 삭제 시, & 조건 활용
변수 변환 시, replace 활용

diamonds = sns.load_dataset("diamonds")

# Categorical -> String 변환
diamonds["cut"] = diamonds["cut"].astype(str)
diamonds["color"] = diamonds["color"].astype(str)
diamonds["clarity"] = diamonds["clarity"].astype(str)

# 1) x, y, z 중 하나라도 0인 데이터를 삭제하세요
diamonds = diamonds[(diamonds.x!=0)&((diamonds.y!=0)&(diamonds.z!=0))]

# 2) x, y, z 를 곱해 'volume' 칼럼을 생성하세요.
diamonds['volume'] = diamonds['x']*diamonds['y']*diamonds['z']

# 3) 범주형 변수인 cut, color, clarity를 머신러닝 모델에 활용할 수 있도록 수치형으로 변환합니다.
diamonds['cut'] = diamonds['cut'].replace({'Fair' : 1 , 'Good' : 2, 'Very Good' : 3, 'Premium' : 4, 'Ideal' : 5})
diamonds['color'] = diamonds['color'].replace({'E':1,'I':2,'J':3,'H':4,'F':5,'G':6,'D':7})
diamonds['clarity'] = diamonds['clarity'].replace({'SI2':1,'SI1':1,'VS1':2,'VS2':2,'VVS2':3,'VVS1':3,'I1':4,'IF':4})

# 결과 출력
print(diamonds.shape)
print(diamonds.head())

`오답노트`

처음에는 아래와 같이 replace를 줄줄 나열했는데 딕셔너리로 간단하게 표현할 수 있음

diamonds_not0['cut_ml'] = diamonds_not0['cut'].replace('Fair', 1).replace('Good', 2).replace('Very Good', 3).replace('Premium', 4).replace('Ideal',5)

`문제 풀이3`

변수 변환 시, loc 활용

# loc를 사용해서 일일이 범주형데이터를 수치형데이터로 변환할 수도 있음
diamonds.loc[diamonds['cut']=='Fair','cut']=1
diamonds.loc[diamonds['cut']=='Good','cut']=2
diamonds.loc[diamonds['cut']=='Very Good','cut']=3
...

쿼리 테스트할 때 갑자기 기억 안 날 때는 loc로 할 수 있지 않을지 생각해보자!

💡 문제 풀이

참고) 범주형 변수를 수치형 변수로 변경하는 방법

티스토리툴바