01. Analysis Seoul CCTV¶

1. 데이터 읽기¶

In [1]:

import pandas as pd

In [2]:

CCTV_Seoul = pd.read_csv("../data/01. Seoul_CCTV.csv")

In [3]:

CCTV_Seoul.head()

Out[3]:

	기관명	소계	2013년도 이전	2014년	2015년	2016년
0	강남구	3238	1292	430	584	932
1	강동구	1010	379	99	155	377
2	강북구	831	369	120	138	204
3	강서구	911	388	258	184	81
4	관악구	2109	846	260	390	613

In [4]:

CCTV_Seoul.columns[0]

Out[4]:

'기관명'

In [5]:

CCTV_Seoul.rename(columns={CCTV_Seoul.columns[0]:"구별"}, inplace=True)

In [6]:

CCTV_Seoul.head()

Out[6]:

	구별	소계	2013년도 이전	2014년	2015년	2016년
0	강남구	3238	1292	430	584	932
1	강동구	1010	379	99	155	377
2	강북구	831	369	120	138	204
3	강서구	911	388	258	184	81
4	관악구	2109	846	260	390	613

In [7]:

pop_Seoul = pd.read_excel("../data/01. Seoul_Population.xls", header = 2, usecols = "B, D, G, J, N")

In [8]:

pop_Seoul.head()

Out[8]:

	자치구	계	계.1	계.2	65세이상고령자
0	합계	10124579	9857426	267153	1365126
1	종로구	164257	154770	9487	26182
2	중구	134593	125709	8884	21384
3	용산구	244444	229161	15283	36882
4	성동구	312711	304808	7903	41273

In [9]:

pop_Seoul.rename(
    columns={
        pop_Seoul.columns[0]:"구별",
        pop_Seoul.columns[1]:"인구수",
        pop_Seoul.columns[2]:"한국인",
        pop_Seoul.columns[3]:"외국인",
        pop_Seoul.columns[4]:"고령자",
    }, 
    inplace=True)

pop_Seoul.head()

Out[9]:

	구별	인구수	한국인	외국인	고령자
0	합계	10124579	9857426	267153	1365126
1	종로구	164257	154770	9487	26182
2	중구	134593	125709	8884	21384
3	용산구	244444	229161	15283	36882
4	성동구	312711	304808	7903	41273

판다스 기초¶

python에서 R 만큼의 강력한 데이터 핸들링 성능을 제공하는 모듈
단일 프로세스에서는 최대 효율
코딩 가능, 응용 가능한 엑셀로 받아들여도 됨
누군가 스테로이드를 맞은 엑셀로 표현

Series¶

index와 value로 이뤄져 있음
한 가지 데이터 타입만 가질 수 있음

In [10]:

import pandas as pd
import numpy as np

In [11]:

pd.Series([1,2,3,"4"], dtype = np.float32)

Out[11]:

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float32

In [12]:

pd.Series(np.array([1,2,3,4]))

Out[12]:

0    1
1    2
2    3
3    4
dtype: int32

In [13]:

data = pd.Series(np.array([1,2,3,4]))

In [14]:

data % 2

Out[14]:

0    1
1    0
2    1
3    0
dtype: int32

날짜 데이터¶

In [15]:

dates = pd.date_range("20210101", periods=6)

DataFrame¶

pd.Series()
- index, value
pd.DataFrame()
- index, value, column

In [16]:

# 표준정규분포에서 샘플링한 난수 생성
data = np.random.randn(6,4)
data

Out[16]:

array([[-0.9983232 , -0.5177093 , -0.0634106 ,  1.72301642],
       [-0.0445168 ,  0.34042433, -0.32321834,  0.60230435],
       [-0.28364106,  0.69408974, -0.86368064, -1.07993789],
       [ 0.01567244,  1.33494352, -0.04903035, -0.00729677],
       [ 3.14661551, -1.69042274,  0.35157791,  1.19959817],
       [ 0.40880637,  1.46020908,  1.83696627, -0.39166071]])

In [17]:

df = pd.DataFrame(data, index=dates, columns=["A","B","C","D"])

데이터 정보 탐색¶

In [18]:

df.tail()

Out[18]:

	A	B	C	D
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-05	3.146616	-1.690423	0.351578	1.199598
2021-01-06	0.408806	1.460209	1.836966	-0.391661

In [19]:

df.index

Out[19]:

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [20]:

df.columns

Out[20]:

Index(['A', 'B', 'C', 'D'], dtype='object')

In [21]:

df.values

Out[21]:

array([[-0.9983232 , -0.5177093 , -0.0634106 ,  1.72301642],
       [-0.0445168 ,  0.34042433, -0.32321834,  0.60230435],
       [-0.28364106,  0.69408974, -0.86368064, -1.07993789],
       [ 0.01567244,  1.33494352, -0.04903035, -0.00729677],
       [ 3.14661551, -1.69042274,  0.35157791,  1.19959817],
       [ 0.40880637,  1.46020908,  1.83696627, -0.39166071]])

dt.info(): 데이터 프레임의 기본 정보 확인

In [22]:

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2021-01-01 to 2021-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes

df. describe(): 데이터 프레임의 기술통계 정보 확인

데이터 정렬¶

sort_values()
특정 컬럼(열)을 기준으로 데이터 정렬

In [23]:

df

Out[23]:

	A	B	C	D
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-05	3.146616	-1.690423	0.351578	1.199598
2021-01-06	0.408806	1.460209	1.836966	-0.391661

In [24]:

df.sort_values(by="B", ascending = False, inplace = True)
df

Out[24]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016
2021-01-05	3.146616	-1.690423	0.351578	1.199598

데이터 선택¶

In [25]:

df

Out[25]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016
2021-01-05	3.146616	-1.690423	0.351578	1.199598

In [26]:

# 한 개 컬럼 선택
df["A"]

Out[26]:

2021-01-06    0.408806
2021-01-04    0.015672
2021-01-03   -0.283641
2021-01-02   -0.044517
2021-01-01   -0.998323
2021-01-05    3.146616
Name: A, dtype: float64

In [27]:

type(df["A"])

Out[27]:

pandas.core.series.Series

In [28]:

df.A

Out[28]:

2021-01-06    0.408806
2021-01-04    0.015672
2021-01-03   -0.283641
2021-01-02   -0.044517
2021-01-01   -0.998323
2021-01-05    3.146616
Name: A, dtype: float64

In [29]:

df[["A","B"]]

Out[29]:

	A	B
2021-01-06	0.408806	1.460209
2021-01-04	0.015672	1.334944
2021-01-03	-0.283641	0.694090
2021-01-02	-0.044517	0.340424
2021-01-01	-0.998323	-0.517709
2021-01-05	3.146616	-1.690423

offset index¶

[n:m] : n부터 m-1까지 선택
인덱스나 컬럼 이름의 이름으로 슬라이스 하는 경우에는 끝을 포함

In [30]:

df

Out[30]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016
2021-01-05	3.146616	-1.690423	0.351578	1.199598

In [31]:

df[0:3]

Out[31]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938

In [32]:

df["20210101":"20210104"]

Out[32]:

	A	B	C	D
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016

loc: location
- index이름으로 특정 행, 열 선택

In [33]:

df.loc[:,["A","B"]]

Out[33]:

	A	B
2021-01-06	0.408806	1.460209
2021-01-04	0.015672	1.334944
2021-01-03	-0.283641	0.694090
2021-01-02	-0.044517	0.340424
2021-01-01	-0.998323	-0.517709
2021-01-05	3.146616	-1.690423

In [34]:

df.loc["20210102":"20210104",["A","D"]]

Out[34]:

	A	D
2021-01-04	0.015672	-0.007297
2021-01-03	-0.283641	-1.079938
2021-01-02	-0.044517	0.602304

In [35]:

df.loc["20210102":"20210104","A":"D"]

Out[35]:

	A	B	C	D
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304

In [36]:

df.loc["20210102", ["A","B"]]

Out[36]:

A   -0.044517
B    0.340424
Name: 2021-01-02 00:00:00, dtype: float64

iloc: inter location
- 컴퓨터가 인식하는 인덱스 값으로 선택

In [37]:

df

Out[37]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016
2021-01-05	3.146616	-1.690423	0.351578	1.199598

In [38]:

df.iloc[3]

Out[38]:

A   -0.044517
B    0.340424
C   -0.323218
D    0.602304
Name: 2021-01-02 00:00:00, dtype: float64

In [39]:

df.iloc[:,2]

Out[39]:

2021-01-06    1.836966
2021-01-04   -0.049030
2021-01-03   -0.863681
2021-01-02   -0.323218
2021-01-01   -0.063411
2021-01-05    0.351578
Name: C, dtype: float64

In [40]:

df.iloc[2:5, 0:2]

Out[40]:

	A	B
2021-01-03	-0.283641	0.694090
2021-01-02	-0.044517	0.340424
2021-01-01	-0.998323	-0.517709

In [41]:

df.iloc[[1,2,4],[0,2]]

Out[41]:

	A	C
2021-01-04	0.015672	-0.049030
2021-01-03	-0.283641	-0.863681
2021-01-01	-0.998323	-0.063411

condition¶

In [42]:

df

Out[42]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016
2021-01-05	3.146616	-1.690423	0.351578	1.199598

In [43]:

# A 컬럼에서 0보다 큰 숫자(양수)만 선택

df["A"]>0

Out[43]:

2021-01-06     True
2021-01-04     True
2021-01-03    False
2021-01-02    False
2021-01-01    False
2021-01-05     True
Name: A, dtype: bool

In [44]:

df[df["A"]>0]

Out[44]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-05	3.146616	-1.690423	0.351578	1.199598

컬럼 추가¶

In [45]:

df

Out[45]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016
2021-01-05	3.146616	-1.690423	0.351578	1.199598

In [46]:

df["E"] = ["one","two","three","four","five","six"]
df

Out[46]:

	A	B	C	D	E
2021-01-06	0.408806	1.460209	1.836966	-0.391661	one
2021-01-04	0.015672	1.334944	-0.049030	-0.007297	two
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938	three
2021-01-02	-0.044517	0.340424	-0.323218	0.602304	four
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016	five
2021-01-05	3.146616	-1.690423	0.351578	1.199598	six

isin()
- 특정 요소가 있는지 확인

In [47]:

df["E"].isin(["two","five","three"])

Out[47]:

2021-01-06    False
2021-01-04     True
2021-01-03     True
2021-01-02    False
2021-01-01     True
2021-01-05    False
Name: E, dtype: bool

In [48]:

df[df["E"].isin(["two","five","three"])]

Out[48]:

	A	B	C	D	E
2021-01-04	0.015672	1.334944	-0.049030	-0.007297	two
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938	three
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016	five

특정 컬럼 제거¶

del
drop

In [49]:

del df["E"]
df

Out[49]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016
2021-01-05	3.146616	-1.690423	0.351578	1.199598

In [50]:

df.drop(["D"], axis = 1) #axis=0 가로, axis=1 세로

Out[50]:

	A	B	C
2021-01-06	0.408806	1.460209	1.836966
2021-01-04	0.015672	1.334944	-0.049030
2021-01-03	-0.283641	0.694090	-0.863681
2021-01-02	-0.044517	0.340424	-0.323218
2021-01-01	-0.998323	-0.517709	-0.063411
2021-01-05	3.146616	-1.690423	0.351578

In [51]:

df.drop(["20210104"])

Out[51]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016
2021-01-05	3.146616	-1.690423	0.351578	1.199598

apply()¶

In [52]:

df

Out[52]:

	A	B	C	D
2021-01-06	0.408806	1.460209	1.836966	-0.391661
2021-01-04	0.015672	1.334944	-0.049030	-0.007297
2021-01-03	-0.283641	0.694090	-0.863681	-1.079938
2021-01-02	-0.044517	0.340424	-0.323218	0.602304
2021-01-01	-0.998323	-0.517709	-0.063411	1.723016
2021-01-05	3.146616	-1.690423	0.351578	1.199598

In [53]:

df["A"].apply("sum")

Out[53]:

2.2446132631353537

In [54]:

df["A"].apply("mean")

Out[54]:

0.37410221052255893

In [55]:

df[["A","D"]].apply("sum")

Out[55]:

A    2.244613
D    2.046024
dtype: float64

In [56]:

df[["A","D"]].apply(np.sum)

Out[56]:

A    2.244613
D    2.046024
dtype: float64

In [57]:

df[["A","D"]].apply(np.std)

Out[57]:

A    1.310720
D    0.948034
dtype: float64

In [58]:

def plusminus(num):
    return "plus" if num > 0 else "minus"

In [59]:

df["A"].apply(plusminus)

Out[59]:

2021-01-06     plus
2021-01-04     plus
2021-01-03    minus
2021-01-02    minus
2021-01-01    minus
2021-01-05     plus
Name: A, dtype: object

In [60]:

df["A"].apply(lambda num:"plus" if num>0 else "minus")

Out[60]:

2021-01-06     plus
2021-01-04     plus
2021-01-03    minus
2021-01-02    minus
2021-01-01    minus
2021-01-05     plus
Name: A, dtype: object

2.CCTV 데이터 훑어보기¶

In [61]:

CCTV_Seoul.head()

Out[61]:

	구별	소계	2013년도 이전	2014년	2015년	2016년
0	강남구	3238	1292	430	584	932
1	강동구	1010	379	99	155	377
2	강북구	831	369	120	138	204
3	강서구	911	388	258	184	81
4	관악구	2109	846	260	390	613

In [62]:

CCTV_Seoul.sort_values(by="소계",ascending=True).head()

Out[62]:

	구별	소계	2013년도 이전	2014년	2015년	2016년
9	도봉구	825	238	159	42	386
2	강북구	831	369	120	138	204
5	광진구	878	573	78	53	174
3	강서구	911	388	258	184	81
24	중랑구	916	509	121	177	109

In [63]:

CCTV_Seoul.sort_values(by="소계",ascending=False).head()

Out[63]:

	구별	소계	2013년도 이전	2014년	2015년	2016년
0	강남구	3238	1292	430	584	932
18	양천구	2482	1843	142	30	467
14	서초구	2297	1406	157	336	398
4	관악구	2109	846	260	390	613
21	은평구	2108	1138	224	278	468

In [64]:

#기존 컬럼이 없으면 추가, 있으면 수정
CCTV_Seoul["최근증가율"] = (
    (CCTV_Seoul["2016년"] + CCTV_Seoul["2015년"] + CCTV_Seoul["2014년"]) / CCTV_Seoul["2013년도 이전"] * 100
)

CCTV_Seoul.sort_values(by="최근증가율",ascending=False).head()

Out[64]:

	구별	소계	2013년도 이전	2014년	2015년	2016년	최근증가율
22	종로구	1619	464	314	211	630	248.922414
9	도봉구	825	238	159	42	386	246.638655
12	마포구	980	314	118	169	379	212.101911
8	노원구	1566	542	57	451	516	188.929889
1	강동구	1010	379	99	155	377	166.490765

3. 인구현황 데이터 훑어보기¶

In [65]:

pop_Seoul.head()

Out[65]:

	구별	인구수	한국인	외국인	고령자
0	합계	10124579	9857426	267153	1365126
1	종로구	164257	154770	9487	26182
2	중구	134593	125709	8884	21384
3	용산구	244444	229161	15283	36882
4	성동구	312711	304808	7903	41273

In [66]:

pop_Seoul.drop([0],axis=0, inplace = True)
pop_Seoul.head()

Out[66]:

	구별	인구수	한국인	외국인	고령자
1	종로구	164257	154770	9487	26182
2	중구	134593	125709	8884	21384
3	용산구	244444	229161	15283	36882
4	성동구	312711	304808	7903	41273
5	광진구	372298	357703	14595	43953

In [67]:

pop_Seoul["구별"].unique()

Out[67]:

array(['종로구', '중구', '용산구', '성동구', '광진구', '동대문구', '중랑구', '성북구', '강북구',
       '도봉구', '노원구', '은평구', '서대문구', '마포구', '양천구', '강서구', '구로구', '금천구',
       '영등포구', '동작구', '관악구', '서초구', '강남구', '송파구', '강동구'], dtype=object)

In [68]:

len(pop_Seoul["구별"].unique())

Out[68]:

In [69]:

# 외국인비율, 고령자비율
pop_Seoul["외국인비율"] = pop_Seoul["외국인"] / pop_Seoul["인구수"] * 100
pop_Seoul["고령자비율"] = pop_Seoul["고령자"] / pop_Seoul["인구수"] * 100
pop_Seoul.head()

Out[69]:

	구별	인구수	한국인	외국인	고령자	외국인비율	고령자비율
1	종로구	164257	154770	9487	26182	5.775705	15.939656
2	중구	134593	125709	8884	21384	6.600640	15.887899
3	용산구	244444	229161	15283	36882	6.252148	15.088118
4	성동구	312711	304808	7903	41273	2.527254	13.198448
5	광진구	372298	357703	14595	43953	3.920247	11.805865

In [70]:

pop_Seoul.sort_values(by="인구수", ascending=False).head(5)

Out[70]:

	구별	인구수	한국인	외국인	고령자	외국인비율	고령자비율
24	송파구	671173	664496	6677	76582	0.994825	11.410173
16	강서구	608255	601691	6564	76032	1.079153	12.500021
23	강남구	561052	556164	4888	65060	0.871220	11.596073
11	노원구	558075	554403	3672	74243	0.657976	13.303409
21	관악구	520929	503297	17632	70046	3.384722	13.446362

In [71]:

pop_Seoul.sort_values(["외국인비율"], ascending=False).head(5)

Out[71]:

	구별	인구수	한국인	외국인	고령자	외국인비율	고령자비율
19	영등포구	402024	368550	33474	53981	8.326369	13.427308
18	금천구	253491	235154	18337	34170	7.233787	13.479769
17	구로구	441559	410742	30817	58794	6.979135	13.315095
2	중구	134593	125709	8884	21384	6.600640	15.887899
3	용산구	244444	229161	15283	36882	6.252148	15.088118

4. 두 데이터 합치기¶

Pandas에서 데이터 프레임을 병합하는 방법¶

pd.concat()
pd.merge()
pd.join()

In [73]:

# 딕셔너리 안의 리스트 형태: 열값 기준

left = pd.DataFrame({
    "key":["K0","K4","K2","K3"],
    "A":["A0","A1","A2","A3"],
    "B":["B0","B1","B2","B3"]
})
left

Out[73]:

	key	A	B
0	K0	A0	B0
1	K4	A1	B1
2	K2	A2	B2
3	K3	A3	B3

In [74]:

#리스트 안의 딕셔너리 형태:행값 기준

right = pd.DataFrame([
    {"key":"K0","C":"C0","D":"D0"},
    {"key":"K1","C":"C1","D":"D1"},
    {"key":"K2","C":"C2","D":"D2"},
    {"key":"K3","C":"C3","D":"D3"},
])
right

Out[74]:

	key	C	D
0	K0	C0	D0
1	K1	C1	D1
2	K2	C2	D2
3	K3	C3	D3

pd.merge()¶

두 데이터 프레임에서 컬럼이나 인덱스를 기준으로 잡고 병합하는 방법
기준이 되는 컬럼이나 인덱스를 키값이라고 함
기준이 되는 키값은 두 데이터 프레임에 모두 포함되어 있어야 함
how="inner"이 기본값

In [75]:

pd.merge(left, right, how="left", on="key")

Out[75]:

	key	A	B	C	D
0	K0	A0	B0	C0	D0
1	K4	A1	B1	NaN	NaN
2	K2	A2	B2	C2	D2
3	K3	A3	B3	C3	D3

In [76]:

CCTV_Seoul.head(1)

Out[76]:

	구별	소계	2013년도 이전	2014년	2015년	2016년	최근증가율
0	강남구	3238	1292	430	584	932	150.619195

In [77]:

pop_Seoul.head(1)

Out[77]:

	구별	인구수	한국인	외국인	고령자	외국인비율	고령자비율
1	종로구	164257	154770	9487	26182	5.775705	15.939656

In [78]:

data_result = pd.merge(CCTV_Seoul, pop_Seoul, on = "구별")
data_result.head()

Out[78]:

	구별	소계	2013년도 이전	2014년	2015년	2016년	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율
0	강남구	3238	1292	430	584	932	150.619195	561052	556164	4888	65060	0.871220	11.596073
1	강동구	1010	379	99	155	377	166.490765	440359	436223	4136	56161	0.939234	12.753458
2	강북구	831	369	120	138	204	125.203252	328002	324479	3523	56530	1.074079	17.234651
3	강서구	911	388	258	184	81	134.793814	608255	601691	6564	76032	1.079153	12.500021
4	관악구	2109	846	260	390	613	149.290780	520929	503297	17632	70046	3.384722	13.446362

년도별 데이터 컬럼 삭제¶

del
drop()

In [79]:

del data_result["2013년도 이전"]
del data_result["2014년"]
data_result.drop(["2015년", "2016년"], axis = 1, inplace = True)

In [80]:

data_result.head()

Out[80]:

	구별	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율
0	강남구	3238	150.619195	561052	556164	4888	65060	0.871220	11.596073
1	강동구	1010	166.490765	440359	436223	4136	56161	0.939234	12.753458
2	강북구	831	125.203252	328002	324479	3523	56530	1.074079	17.234651
3	강서구	911	134.793814	608255	601691	6564	76032	1.079153	12.500021
4	관악구	2109	149.290780	520929	503297	17632	70046	3.384722	13.446362

인덱스 변경¶

set_index()
선택한 컬럼을 데이터 프레임의 인덱스로 지정

In [81]:

data_result.set_index("구별", inplace = True)
data_result.head()

Out[81]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율
구별
강남구	3238	150.619195	561052	556164	4888	65060	0.871220	11.596073
강동구	1010	166.490765	440359	436223	4136	56161	0.939234	12.753458
강북구	831	125.203252	328002	324479	3523	56530	1.074079	17.234651
강서구	911	134.793814	608255	601691	6564	76032	1.079153	12.500021
관악구	2109	149.290780	520929	503297	17632	70046	3.384722	13.446362

상관계수¶

corr
correlation의 약자입니다
상관계수가 0.2 이상인 데이터를 비교

In [82]:

data_result.corr()

Out[82]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율
소계	1.000000	-0.264378	0.232555	0.227852	0.030421	0.163905	-0.045956	-0.267841
최근증가율	-0.264378	1.000000	-0.097165	-0.086341	-0.156421	-0.072251	-0.047102	0.190396
인구수	0.232555	-0.097165	1.000000	0.998151	-0.167243	0.936737	-0.601076	-0.637414
한국인	0.227852	-0.086341	0.998151	1.000000	-0.226853	0.936155	-0.645463	-0.628360
외국인	0.030421	-0.156421	-0.167243	-0.226853	1.000000	-0.175318	0.838612	-0.021147
고령자	0.163905	-0.072251	0.936737	0.936155	-0.175318	1.000000	-0.620300	-0.348840
외국인비율	-0.045956	-0.047102	-0.601076	-0.645463	0.838612	-0.620300	1.000000	0.242816
고령자비율	-0.267841	0.190396	-0.637414	-0.628360	-0.021147	-0.348840	0.242816	1.000000

In [83]:

data_result["CCTV비율"] = data_result["소계"] / data_result["인구수"] * 100
data_result.head()

Out[83]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율	CCTV비율
구별
강남구	3238	150.619195	561052	556164	4888	65060	0.871220	11.596073	0.577130
강동구	1010	166.490765	440359	436223	4136	56161	0.939234	12.753458	0.229358
강북구	831	125.203252	328002	324479	3523	56530	1.074079	17.234651	0.253352
강서구	911	134.793814	608255	601691	6564	76032	1.079153	12.500021	0.149773
관악구	2109	149.290780	520929	503297	17632	70046	3.384722	13.446362	0.404854

In [84]:

data_result.sort_values(by="CCTV비율",ascending=True).head()

Out[84]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율	CCTV비율
구별
강서구	911	134.793814	608255	601691	6564	76032	1.079153	12.500021	0.149773
송파구	1081	104.347826	671173	664496	6677	76582	0.994825	11.410173	0.161061
중랑구	916	79.960707	412780	408226	4554	59262	1.103251	14.356800	0.221910
강동구	1010	166.490765	440359	436223	4136	56161	0.939234	12.753458	0.229358
광진구	878	53.228621	372298	357703	14595	43953	3.920247	11.805865	0.235833

matplotlib 기초¶

In [85]:

import matplotlib.pyplot as plt
from matplotlib import rc

rc("font", family = "Malgun Gothic")
rc('axes',unicode_minus=False)
%matplotlib inline

matplotlib 그래프 기본 형태¶

plt.figure(figsize = (10,6)) #도화지 준비
plt.plot(x, y)
plt.show

In [86]:

plt.figure(figsize = (10, 6))
plt.plot([0,1,2,3,4,5,6,7,8,9],[1,1,2,3,4,2,3,5,-1,3])
plt.show()

예제1:그래프 기초¶

삼각함수 그리기¶

np.arrange(a, b, s): a부터 b까지 s의 간격
np.sin(value)

In [87]:

import numpy as np

t = np.arange(0, 12, 0.01)
y = np.sin(t)

In [88]:

plt.figure(figsize = (10,6))
plt.plot(t, np.sin(t))
plt.plot(t, np.cos(t))
plt.show()

1. 격자무늬 추가
1. 그래프 제목 추가
1. x축, y축 제목 추가
1. 주황색, 파란색 선 데이터 의미 구분

In [89]:

def drawGraph():

    plt.figure(figsize = (10,6))
    plt.plot(t, np.sin(t), label="sin")
    plt.plot(t, np.cos(t), label ="cos")
    plt.grid()
    plt.legend(loc=1)
    plt.title("Example of sinwave")
    plt.xlabel("time")
    plt.ylabel("Amplitude") #진폭
    plt.show()

In [90]:

drawGraph()

예제2. 그래프 커스텀¶

In [93]:

t = np.arange(0, 5, 0.5)
t

Out[93]:

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

In [100]:

plt.figure(figsize=(10,6))
plt.plot(t,t,"r--") # red ---
plt.plot(t, t**2, "bs") # blue square
plt.plot(t, t**3, "g>") # green triangle
plt.show

Out[100]:

<function matplotlib.pyplot.show(close=None, block=None)>

In [101]:

#t = [0, 1, 2, 3, 4, 5, 6]
t = list(range(0, 7))
y = [1, 4, 5, 8, 9, 5, 3]

In [105]:

def drawGraph():

    plt.figure(figsize=(10, 6))
    plt.plot(
        t, y,
        color = "green",
        linestyle = "dashed",
        marker = "o",
        markerfacecolor="blue",
        markersize=15,
    )
    plt.xlim([-0.5, 6.5]) #x축 범위 지정
    plt.ylim([0.5, 9.5]) #y축 범위 지정
    plt.show()

drawGraph()

예제3: scatter plot¶

In [107]:

t = np.array(range(0,10))
y = np.array([9, 8, 7, 9, 8, 3, 2, 4, 3, 4])

In [111]:

def drawGraph():

    plt.figure(figsize = (10, 6))
    plt.scatter(t, y)
    plt.show()
drawGraph()

In [115]:

colormap = t

def drawGraph():

    plt.figure(figsize = (10, 6))
    plt.scatter(t, y, s=50, c=colormap, marker=">")
    plt.colorbar()
    plt.show()

drawGraph()

예제4: pandas에서 plot 그리기¶

matplotlib을 가져와서 사용

In [116]:

data_result.head()

Out[116]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율	CCTV비율
구별
강남구	3238	150.619195	561052	556164	4888	65060	0.871220	11.596073	0.577130
강동구	1010	166.490765	440359	436223	4136	56161	0.939234	12.753458	0.229358
강북구	831	125.203252	328002	324479	3523	56530	1.074079	17.234651	0.253352
강서구	911	134.793814	608255	601691	6564	76032	1.079153	12.500021	0.149773
관악구	2109	149.290780	520929	503297	17632	70046	3.384722	13.446362	0.404854

In [117]:

data_result.plot()

Out[117]:

<AxesSubplot: xlabel='구별'>

In [127]:

data_result["인구수"].plot(kind="barh", grid=True, figsize=(10, 10));

5. 데이터 시각화¶

In [129]:

import matplotlib.pyplot as plt
from matplotlib import rc

plt.rcParams["axes.unicode_minus"] = False  # '-' 부호 때문에 한글 깨지는 것 방지
rc("font",family="Malgun Gothic")
%matplotlib inline

In [130]:

data_result.head()

Out[130]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율	CCTV비율
구별
강남구	3238	150.619195	561052	556164	4888	65060	0.871220	11.596073	0.577130
강동구	1010	166.490765	440359	436223	4136	56161	0.939234	12.753458	0.229358
강북구	831	125.203252	328002	324479	3523	56530	1.074079	17.234651	0.253352
강서구	911	134.793814	608255	601691	6564	76032	1.079153	12.500021	0.149773
관악구	2109	149.290780	520929	503297	17632	70046	3.384722	13.446362	0.404854

소계 칼럼 시각화¶

In [134]:

data_result["소계"].plot(kind="barh", grid=True, figsize=(10, 10));

In [137]:

def drawGraph():
    data_result["소계"].sort_values().plot(
        kind="barh", grid=True, figsize=(10, 10));
drawGraph()

In [138]:

data_result.head()

Out[138]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율	CCTV비율
구별
강남구	3238	150.619195	561052	556164	4888	65060	0.871220	11.596073	0.577130
강동구	1010	166.490765	440359	436223	4136	56161	0.939234	12.753458	0.229358
강북구	831	125.203252	328002	324479	3523	56530	1.074079	17.234651	0.253352
강서구	911	134.793814	608255	601691	6564	76032	1.079153	12.500021	0.149773
관악구	2109	149.290780	520929	503297	17632	70046	3.384722	13.446362	0.404854

In [139]:

def drawGraph():
    data_result["CCTV비율"].sort_values().plot(
        kind="barh", grid=True, figsize=(10, 10));
drawGraph()

6. 데이터의 경향 표시¶

In [140]:

data_result.head()

Out[140]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율	CCTV비율
구별
강남구	3238	150.619195	561052	556164	4888	65060	0.871220	11.596073	0.577130
강동구	1010	166.490765	440359	436223	4136	56161	0.939234	12.753458	0.229358
강북구	831	125.203252	328002	324479	3523	56530	1.074079	17.234651	0.253352
강서구	911	134.793814	608255	601691	6564	76032	1.079153	12.500021	0.149773
관악구	2109	149.290780	520929	503297	17632	70046	3.384722	13.446362	0.404854

인구수와 소계 컬럼으로 scatter plot 그리기¶

In [142]:

def drawGraph():
    plt.figure(figsize=(14,10))
    plt.scatter(data_result["인구수"], data_result["소계"], s=50)
    plt.xlabel("인구수")
    plt.ylabel("CCTV")
    plt.grid(True)
    plt.show()
drawGraph()

Numpy를 이용한 1차 직선 만들기¶

np.polyfit(): 직선 구성하기 위한 계수 계산
np.poly1d(): polyfit으로 찾은 계수로 파이썬에서 사용할 수 있는 함수로 만들어주는 기능

In [143]:

import numpy as np

In [144]:

fpl = np.polyfit(data_result["인구수"], data_result["소계"], 1)
fpl

Out[144]:

array([1.11155868e-03, 1.06515745e+03])

In [145]:

f1 = np.poly1d(fpl)
f1

Out[145]:

poly1d([1.11155868e-03, 1.06515745e+03])

인구가 40만인 구에서 서울시의 전체 경향에 맞는 적당한 cctv수는?

In [149]:

f1(400000)

Out[149]:

1509.7809252413338

경향선을 그리기 위한 x데이터 생성
np.linspace(a, b, n): a부터 b까지 n개의 등간격 데이터 생성

In [151]:

fx = np.linspace(100000, 700000, 100)

In [155]:

def drawGraph():
    plt.figure(figsize=(14,10))
    plt.scatter(data_result["인구수"], data_result["소계"], s=50)
    plt.plot(fx, f1(fx), ls="dashed", lw=3, color = "g")
    plt.xlabel("인구수")
    plt.ylabel("CCTV")
    plt.grid(True)
    plt.show()
drawGraph()

7. 강조하고 싶은 데이터를 시각화해보자¶

그래프 다듬기¶

경향과의 오차 만들기¶

경향(trend)과의 오차를 만들자
경향은 f1 함수에 해당 인구를 입력
f1(data_result["인구수"])

In [156]:

fpl = np.polyfit(data_result["인구수"], data_result["소계"], 1)
f1 = np.poly1d(fpl)
fx = np.linspace(100000, 700000, 100)

In [157]:

data_result["오차"] = data_result["소계"] - f1(data_result["인구수"])

In [159]:

data_result.head()

Out[159]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율	CCTV비율	오차
구별
강남구	3238	150.619195	561052	556164	4888	65060	0.871220	11.596073	0.577130	1549.200326
강동구	1010	166.490765	440359	436223	4136	56161	0.939234	12.753458	0.229358	-544.642322
강북구	831	125.203252	328002	324479	3523	56530	1.074079	17.234651	0.253352	-598.750923
강서구	911	134.793814	608255	601691	6564	76032	1.079153	12.500021	0.149773	-830.268578
관악구	2109	149.290780	520929	503297	17632	70046	3.384722	13.446362	0.404854	464.799395

In [160]:

#경향과 비교해서 데이터의 오차가 너무 나는 데이터 계산

df_sort_f = data_result.sort_values(by="오차", ascending = False)  #내림차순
df_sort_t = data_result.sort_values(by="오차", ascending = True)  #오름차순

In [161]:

#경향 대비 cctv 많이 가진 구
df_sort_f.head()

Out[161]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율	CCTV비율	오차
구별
강남구	3238	150.619195	561052	556164	4888	65060	0.871220	11.596073	0.577130	1549.200326
양천구	2482	34.671731	475018	471154	3864	55234	0.813443	11.627770	0.522507	888.832166
용산구	2096	53.216374	244444	229161	15283	36882	6.252148	15.088118	0.857456	759.128697
서초구	2297	63.371266	445401	441102	4299	53205	0.965198	11.945415	0.515715	736.753199
은평구	2108	85.237258	491202	486794	4408	74559	0.897390	15.178888	0.429151	496.842700

In [162]:

#경향 대비 cctv 적게 가진 구
df_sort_t.head()

Out[162]:

	소계	최근증가율	인구수	한국인	외국인	고령자	외국인비율	고령자비율	CCTV비율	오차
구별
강서구	911	134.793814	608255	601691	6564	76032	1.079153	12.500021	0.149773	-830.268578
송파구	1081	104.347826	671173	664496	6677	76582	0.994825	11.410173	0.161061	-730.205628
도봉구	825	246.638655	346234	344166	2068	53488	0.597284	15.448512	0.238278	-625.016861
중랑구	916	79.960707	412780	408226	4554	59262	1.103251	14.356800	0.221910	-607.986645
광진구	878	53.228621	372298	357703	14595	43953	3.920247	11.805865	0.235833	-600.988527

In [164]:

from matplotlib.colors import ListedColormap

#colormap을 사용자 정의(user define)로 세팅
color_step = ["#e74c3c","#2ecc71","#95a9a6","#2ecc71","#3498db","#3489db"]
my_cmap = ListedColormap(color_step)

plt.text(df_sort_f["인구수"][0]1.02,df_sort_f["소계"][0]0.98, df_sort_f.index[0],fontsize=15)

plt.text(x좌표, y좌표, "이름", 사이즈)

In [185]:

def drawGraph():
    plt.figure(figsize=(14,10))
    plt.scatter(data_result["인구수"], data_result["소계"], s=50, c=data_result["오차"], cmap=my_cmap)
    #  색깔을 어떤 것을 기준으로 설정할 것인지, cmap = 어떤 색깔로 할 건지
    plt.colorbar()
    
    for n in range(5):
        #상위 5개
        plt.text(
            df_sort_f["인구수"][n]*1.02,
            df_sort_f["소계"][n]*0.98,
            df_sort_f.index[n],
            fontsize = 15
        )

        #하위 5개
        plt.text(
            df_sort_t["인구수"][n]*1.02,
            df_sort_t["소계"][n]*0.98,
            df_sort_t.index[n],
            fontsize = 15
        )

    plt.plot(fx, f1(fx), ls="dashed", lw=3, color = "g")
    plt.xlabel("인구수")
    plt.ylabel("CCTV")
    plt.grid(True)
    plt.show()
drawGraph()

In [186]:

data_result.to_csv("../data/01. CCTV_result.csv", sep=",",encoding="utf-8")

EDA) 셀프 주유소 가격 분석 (0)	2023.03.10
EDA) Selenium 기초 (0)	2023.03.10
EDA) 네이버 영화순위 시각화 (0)	2023.03.10
EDA) 웹크롤링 기초 예제 - 시카고 샌드위치 (0)	2023.03.10
EDA) 서울시 범죄현황 시각화 (0)	2023.03.10

ABOUT ME