본문 바로가기

인공지능/캐글 kaggle

Kaggle Competition - Getting Start! Datasets - titanic

반응형

캐글 Competition - Getting start! 해보기

Titanic - Machine Learning from Disaster

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

# import library

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os

data_folder = "datasets/titanic"
print(os.listdir(data_folder))
train_data = pd.read_csv(os.path.join(data_folder, 'train.csv'))
test_data = pd.read_csv(os.path.join(data_folder, 'test.csv'))
train_data.head()
test_data.head()

# 데이터 설명
column Definition Key
survival Survival No = 0, Yes = 1
pclass Ticket class 1st = 1, 2nd = 2, 3 = 3rd
sex Sex male or female
age age in years 10, 20, 30 ...
sibsp of siblings / spouses aboard the Titanic
parch of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked port of embarkation C = Cherbourg, Q = Queenstown, S = Southampton

# 데이터에 널 값 조사
train_data.keys()

### passengerId - 승객 ID
passengerId = train_data['PassengerId']

print('존재하는 값: ', set(passengerId))
print('shape: ', passengerId.shape)

### survivied - 생존 여부
survived = train_data['Survived']

print('존재하는 값: ', set(survived))
print('shape: ', survived.shape)

Pclass - 좌석 등급

pclass = train_data['Pclass']

print('존재하는 값: ', set(pclass))
print('shape: ', pclass.shape)

name - 승객 이름

name = train_data['Name']

print('존재하는 값: ', set(name))
print('shape: ', name.shape)

sex - 성별


sex = train_data['Sex']
print('존재하는 값: ', set(sex))
print('shape: ', sex.shape)

age - 나이


```python age = train_data['Age']

print('존재하는 값:', set(age))
print('shape: ', age.shape)


### age 값에 nan값이 존재해 이것을 채워주자 
- 나이의 평균 값으로 치환하기

<hr>

```python
mean_age = int(age.mean())

print('나이의 평균 값: ', mean_age)

age = age.fillna(mean_age)

print(age)

age 시각화 해주기


```python min_age = age.min() max_age = age.max()

fig = plt.figure(figsize=(15, 15))
plt.title('age visualization but 29 is mean_of_age')
plt.ylim(min_age, max_age)
plt.scatter(range(len(age)),age[:])
plt.xlabel('len of age')
plt.ylabel('age')
plt.show()


### sibsp - 형제, 배우자 여부
<hr>

```python
sibsp = train_data['SibSp']

print('존재하는 값: ', set(sibsp))
print('shape: ', sibsp.shape)

parch - 아이들, 부모님 여부


parch = train_data['Parch']

print('존재하는 값: ', set(parch))
print('shape: ', parch.shape)

ticket - 티켓 등급?


ticket = train_data['Ticket']

print('존재하는 값: ', set(ticket))
print('shape: ', ticket.shape)

fare - 요금


```python fare = train_data['Fare']

print('존재하는 값:', set(fare))
print('shape: ', fare.shape)


### cabin - 케빈 번호
<hr>
```python
cabin = train_data['Cabin']

print('존재하는 값: ', set(cabin))
print('shape: ', cabin.shape)

embarked - 승선 장소


```python embarked = train_data['Embarked']

print('존재하는 값: ',set(embarked))
print('shape: ', embarked.shape)
```

반응형