반응형
캐글 Competition - Getting start! 해보기
Titanic - Machine Learning from Disaster
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
# import library
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
data_folder = "datasets/titanic"
print(os.listdir(data_folder))
train_data = pd.read_csv(os.path.join(data_folder, 'train.csv'))
test_data = pd.read_csv(os.path.join(data_folder, 'test.csv'))
train_data.head()
test_data.head()
# 데이터 설명
column | Definition | Key |
---|---|---|
survival | Survival | No = 0, Yes = 1 |
pclass | Ticket class | 1st = 1, 2nd = 2, 3 = 3rd |
sex | Sex | male or female |
age | age in years | 10, 20, 30 ... |
sibsp | of siblings / spouses aboard the Titanic | |
parch | of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | port of embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
# 데이터에 널 값 조사
train_data.keys()
### passengerId - 승객 ID
passengerId = train_data['PassengerId']
print('존재하는 값: ', set(passengerId))
print('shape: ', passengerId.shape)
### survivied - 생존 여부
survived = train_data['Survived']
print('존재하는 값: ', set(survived))
print('shape: ', survived.shape)
Pclass - 좌석 등급
pclass = train_data['Pclass']
print('존재하는 값: ', set(pclass))
print('shape: ', pclass.shape)
name - 승객 이름
name = train_data['Name']
print('존재하는 값: ', set(name))
print('shape: ', name.shape)
sex - 성별
sex = train_data['Sex']
print('존재하는 값: ', set(sex))
print('shape: ', sex.shape)
age - 나이
```python age = train_data['Age']
print('존재하는 값:', set(age))
print('shape: ', age.shape)
### age 값에 nan값이 존재해 이것을 채워주자
- 나이의 평균 값으로 치환하기
<hr>
```python
mean_age = int(age.mean())
print('나이의 평균 값: ', mean_age)
age = age.fillna(mean_age)
print(age)
age 시각화 해주기
```python min_age = age.min() max_age = age.max()
fig = plt.figure(figsize=(15, 15))
plt.title('age visualization but 29 is mean_of_age')
plt.ylim(min_age, max_age)
plt.scatter(range(len(age)),age[:])
plt.xlabel('len of age')
plt.ylabel('age')
plt.show()
### sibsp - 형제, 배우자 여부
<hr>
```python
sibsp = train_data['SibSp']
print('존재하는 값: ', set(sibsp))
print('shape: ', sibsp.shape)
parch - 아이들, 부모님 여부
parch = train_data['Parch']
print('존재하는 값: ', set(parch))
print('shape: ', parch.shape)
ticket - 티켓 등급?
ticket = train_data['Ticket']
print('존재하는 값: ', set(ticket))
print('shape: ', ticket.shape)
fare - 요금
```python fare = train_data['Fare']
print('존재하는 값:', set(fare))
print('shape: ', fare.shape)
### cabin - 케빈 번호
<hr>
```python
cabin = train_data['Cabin']
print('존재하는 값: ', set(cabin))
print('shape: ', cabin.shape)
embarked - 승선 장소
```python embarked = train_data['Embarked']
print('존재하는 값: ',set(embarked))
print('shape: ', embarked.shape)
```
반응형
'인공지능 > 캐글 kaggle' 카테고리의 다른 글
Kaggle Competition - Getting Start! Datasets - titanic (3) (0) | 2024.03.29 |
---|---|
Kaggle Competition - Getting Start! Datasets - titanic (2) (0) | 2024.03.29 |