03-2. Naver Movie Ranking¶

1. 네이버 영화 평점 사이트 분석¶

http://movie.naver.com
영화랭킹 탭 이동
영화랭킹에서 평점순(현재상영영화) 선택

https://movie.naver.com/movie/sdb/rank/rmovie.naver?sel=cur&date=20230307

원하는 정보 얻기 위해서 변화시켜줘야 하는 주소의 규칙을 찾을 수 있음(날짜정보 변경 -> 해당 페이지로 접근 가능)

In [1]:

#requirements

import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import warnings
warnings.filterwarnings(action="ignore")

In [ ]:

url = "https://movie.naver.com/movie/sdb/rank/rmovie.naver?sel=cur&date=20230307"
response = urlopen(url)
# response.status

soup = BeautifulSoup(response,"html.parser")
print(soup.prettify())

In [ ]:

#영화 제목 태그
soup.find_all("div", class_="tit5")  # soup.select("div.tit5") #class는 "."

In [4]:

soup.find_all("div", class_="tit5")[0].a.string

Out[4]:

'아임 히어로 더 파이널'

In [5]:

soup.find_all("div", class_="tit5")[0].find("a").text

Out[5]:

'아임 히어로 더 파이널'

In [6]:

soup.select("div.tit5")[0].select_one("a").get_text()

Out[6]:

'아임 히어로 더 파이널'

In [ ]:

# 영화 평점 태그
soup.find_all("td","point") #soup.select(".point")

In [8]:

len(soup.find_all("td","point"))

Out[8]:

In [9]:

soup.find_all("td","point")[0].text #soup.selct(".point")[0].string

Out[9]:

'9.87'

In [ ]:

# 영화제목 리스트

end = len(soup.find_all("div","tit5"))
movie_name = []

for n in range(0, end):
    movie_name.append(soup.find_all("div","tit5")[n].a.text)
    
#movie_name = [movie_name.append(soup.find_all("div","tit5")[n].a.text for n in range(0, end)]

movie_name

In [ ]:

# 영화평점 리스트
end = len(soup.find_all("td","point"))

movie_point = [soup.find_all("td","point")[n].text for n in range(0, end)]
movie_point

In [13]:

# 전체 데이터 수 확인
len(movie_name), len(movie_point)

Out[13]:

(50, 50)

2. 자동화를 위한 코드¶

In [14]:

date = pd.date_range("2023.01.01", periods=100, freq="D") 
date

Out[14]:

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10', '2023-01-11', '2023-01-12',
               '2023-01-13', '2023-01-14', '2023-01-15', '2023-01-16',
               '2023-01-17', '2023-01-18', '2023-01-19', '2023-01-20',
               '2023-01-21', '2023-01-22', '2023-01-23', '2023-01-24',
               '2023-01-25', '2023-01-26', '2023-01-27', '2023-01-28',
               '2023-01-29', '2023-01-30', '2023-01-31', '2023-02-01',
               '2023-02-02', '2023-02-03', '2023-02-04', '2023-02-05',
               '2023-02-06', '2023-02-07', '2023-02-08', '2023-02-09',
               '2023-02-10', '2023-02-11', '2023-02-12', '2023-02-13',
               '2023-02-14', '2023-02-15', '2023-02-16', '2023-02-17',
               '2023-02-18', '2023-02-19', '2023-02-20', '2023-02-21',
               '2023-02-22', '2023-02-23', '2023-02-24', '2023-02-25',
               '2023-02-26', '2023-02-27', '2023-02-28', '2023-03-01',
               '2023-03-02', '2023-03-03', '2023-03-04', '2023-03-05',
               '2023-03-06', '2023-03-07', '2023-03-08', '2023-03-09',
               '2023-03-10', '2023-03-11', '2023-03-12', '2023-03-13',
               '2023-03-14', '2023-03-15', '2023-03-16', '2023-03-17',
               '2023-03-18', '2023-03-19', '2023-03-20', '2023-03-21',
               '2023-03-22', '2023-03-23', '2023-03-24', '2023-03-25',
               '2023-03-26', '2023-03-27', '2023-03-28', '2023-03-29',
               '2023-03-30', '2023-03-31', '2023-04-01', '2023-04-02',
               '2023-04-03', '2023-04-04', '2023-04-05', '2023-04-06',
               '2023-04-07', '2023-04-08', '2023-04-09', '2023-04-10'],
              dtype='datetime64[ns]', freq='D')

In [15]:

date[0]

Out[15]:

Timestamp('2023-01-01 00:00:00', freq='D')

In [16]:

#포맷 바꾸기
date[0].strftime("%Y-%m-%d")

Out[16]:

'2023-01-01'

In [17]:

#문자열 format
test_string = "Hi, I'm {name}"
test_string.format(name = "홍길동")
# dir(test_string): 사용할 수 있는 함수 조회

Out[17]:

"Hi, I'm 홍길동"

In [18]:

import time
from tqdm import tqdm

movie_date = []
movie_name = []
movie_point = []

for today in tqdm(date):
    url = "https://movie.naver.com/movie/sdb/rank/rmovie.naver?sel=cur&date={date}"
    response = urlopen(url.format(date = today.strftime("%Y%m%d")))
    soup = BeautifulSoup(response, "html.parser")
    
    end = len(soup.find_all("td","point"))
    movie_date.extend([today for _ in range(0, end)])
    movie_name.extend([soup.select("div.tit5")[n].find("a").text for n in range(0, end)])
    movie_point.extend([soup.find_all("td","point")[n].text for n in range(0, end)])
    
    time.sleep(0.5) #차단 방지용, 사람인척

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [02:23<00:00,  1.43s/it]

In [19]:

len(movie_date), len(movie_name), len(movie_point)

Out[19]:

(4618, 4618, 4618)

In [20]:

movie_name[:5], movie_date[:5], movie_point[:5]

Out[20]:

(['씽2게더', '올빼미', '시네마 천국', '극장판 주술회전 0', '어바웃 타임'],
 [Timestamp('2023-01-01 00:00:00', freq='D'),
  Timestamp('2023-01-01 00:00:00', freq='D'),
  Timestamp('2023-01-01 00:00:00', freq='D'),
  Timestamp('2023-01-01 00:00:00', freq='D'),
  Timestamp('2023-01-01 00:00:00', freq='D')],
 ['9.37', '9.36', '9.33', '9.22', '9.18'])

In [21]:

movie = pd.DataFrame({
    "date": movie_date, 
    "name": movie_name,
    "point": movie_point
})
movie.tail()

Out[21]:

	date	name	point
4613	2023-04-10	대외비	6.18
4614	2023-04-10	블랙 팬서: 와칸다 포에버	6.04
4615	2023-04-10	교섭	5.81
4616	2023-04-10	마루이 비디오	5.62
4617	2023-04-10	귀멸의 칼날: 상현집결, 그리고 도공 마을로	5.28

In [22]:

movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4618 entries, 0 to 4617
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    4618 non-null   datetime64[ns]
 1   name    4618 non-null   object        
 2   point   4618 non-null   object        
dtypes: datetime64[ns](1), object(2)
memory usage: 108.4+ KB

In [23]:

movie["point"] = movie["point"].astype(float)

In [24]:

# 데이터 저장
movie.to_csv("../data/03. naver_moive_data.csv",sep=",",encoding = "utf-8")

3. 영화 평점 데이터 정리¶

In [25]:

import numpy as np

In [26]:

movie = pd.read_csv("../data/03. naver_moive_data.csv", index_col=0)
movie

Out[26]:

	date	name	point
0	2023-01-01	씽2게더	9.37
1	2023-01-01	올빼미	9.36
2	2023-01-01	시네마 천국	9.33
3	2023-01-01	극장판 주술회전 0	9.22
4	2023-01-01	어바웃 타임	9.18
...	...	...	...
4613	2023-04-10	대외비	6.18
4614	2023-04-10	블랙 팬서: 와칸다 포에버	6.04
4615	2023-04-10	교섭	5.81
4616	2023-04-10	마루이 비디오	5.62
4617	2023-04-10	귀멸의 칼날: 상현집결, 그리고 도공 마을로	5.28

4618 rows × 3 columns

영화 이름으로 인덱스 잡기
점수 합산 구하기
100일간 네이버 영화 평점 합산 기준 베스트&워스트 10 선정

In [27]:

# pivot table

movie_unique = pd.pivot_table(data=movie, index="name", aggfunc=np.sum)

In [28]:

movie_best = movie_unique.sort_values(by="point", ascending=False)
movie_best.head()

Out[28]:

	point
name
올빼미	931.48
극장판 주술회전 0	921.20
더 퍼스트 슬램덩크	919.02
러브레터	913.76
영웅	899.86

In [29]:

tmp = movie.query("name == ['여름날 우리']")
tmp

Out[29]:

	date	name	point
1153	2023-01-26	여름날 우리	8.76
1204	2023-01-27	여름날 우리	8.76
1253	2023-01-28	여름날 우리	8.76
1301	2023-01-29	여름날 우리	8.76
1347	2023-01-30	여름날 우리	8.76
...	...	...	...
4394	2023-04-06	여름날 우리	8.76
4442	2023-04-07	여름날 우리	8.76
4490	2023-04-08	여름날 우리	8.76
4538	2023-04-09	여름날 우리	8.76
4586	2023-04-10	여름날 우리	8.76

74 rows × 3 columns

In [30]:

# 시각화

import matplotlib.pyplot as plt
from matplotlib import rc

rc("font",family="Malgun Gothic")
%matplotlib inline

In [31]:

plt.figure(figsize=(20,8))
plt.plot(tmp["date"], tmp["point"]) #x축 날짜, y축 평점 데이터 -> 날짜에 따른 평점 변화를 선 그래프로 표현
plt.title("날짜별 평점")
plt.xlabel("날짜")
plt.ylabel("평점")
plt.xticks(rotation="vertical")
plt.legend(labels=["평점 추이"])
plt.grid(True)

In [32]:

# 상위 10개 영화
movie_best.head(10)

Out[32]:

	point
name
올빼미	931.48
극장판 주술회전 0	921.20
더 퍼스트 슬램덩크	919.02
러브레터	913.76
영웅	899.86
너의 이름은.	881.00
아바타: 물의 길	864.79
헤어질 결심	864.32
라라랜드	863.00
오늘 밤, 세계에서 이 사랑이 사라진다 해도	860.81

In [33]:

# 하위 10개 영화
movie_best.tail(10)

Out[33]:

	point
name
캐롤	68.88
뮬란	63.70
업	55.98
찬실이는 복도 많지	52.62
범죄도시2	46.69
강남좀비	25.43
너의 췌장을 먹고 싶어	16.50
프레이 포 더 데블	13.40
씽2게더	9.37
초속5센티미터	8.36

In [34]:

movie_pivot = pd.pivot_table(data=movie, index="date", columns="name", values="point")
movie_pivot.head()

Out[34]:

name	3000년의 기다림	TAR 타르	강남좀비	거북이는 의외로 빨리 헤엄친다	겨울왕국	겨울왕국 2	교섭	귀멸의 칼날: 상현집결, 그리고 도공 마을로	그린 나이트	극장판 5등분의 신부	...	탑건: 매버릭	티파니에서 아침을	파수꾼	패터슨	프레이 포 더 데블	프렌치 디스패치	피아니스트의 전설	하녀	항거:유관순 이야기	헤어질 결심
date
2023-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	8.3	6.7	NaN	NaN	NaN	NaN	8.67
2023-01-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	8.3	6.7	NaN	NaN	NaN	NaN	8.67
2023-01-03	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	8.3	NaN	NaN	NaN	NaN	NaN	8.66
2023-01-04	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	8.3	NaN	NaN	NaN	NaN	NaN	8.66
2023-01-05	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	9.19	8.3	NaN	NaN	NaN	NaN	NaN	8.66

5 rows × 112 columns

In [35]:

movie_pivot.to_excel("../data/03. movie_pivot.xlsx",encoding="euc-kr")
# 만약 multi-columns 로 잡힌다면 movie_pivot.columns = movie_pivot.columns.droplevel()

4. 그래프 그리기¶

In [36]:

import platform
import seaborn as sns
from matplotlib import font_manager, rc

path = "C:/Windows/Fonts/malgun.ttf"

if platform.system() == "Darwin":
    rc("font",family = "Arial Unicode MS")
elif platform.system() == "Windows":
    font_name = font_manager.FontProperties(fname=path).get_name()
    rc("font", family=font_name)
else:
    print("Unknown system")

In [37]:

target_col = ["씽2게더", "올빼미", "시네마 천국", "극장판 주술회전 0", "어바웃 타임"]
plt.figure(figsize=(20,8))
plt.title("날짜별 평점")
plt.xlabel("날짜")
plt.xticks(rotation="vertical")
plt.tick_params(bottom="off", labelbottom="off")
plt.ylabel("평점")
plt.plot(movie_pivot[target_col])
plt.legend(target_col, loc="best")
plt.grid(True)

EDA) 셀프 주유소 가격 분석 (0)	2023.03.10
EDA) Selenium 기초 (0)	2023.03.10
EDA) 웹크롤링 기초 예제 - 시카고 샌드위치 (0)	2023.03.10
EDA) 서울시 범죄현황 시각화 (0)	2023.03.10
EDA) 서울시 인구수 및 CCTV 개수 시각화 (0)	2023.03.10

ABOUT ME

binlog binlog

03-2. Naver Movie Ranking¶

1. 네이버 영화 평점 사이트 분석¶

2. 자동화를 위한 코드¶

3. 영화 평점 데이터 정리¶

4. 그래프 그리기¶

'EDA' 카테고리의 다른 글

티스토리툴바

ABOUT ME

03-2. Naver Movie Ranking¶

1. 네이버 영화 평점 사이트 분석¶

2. 자동화를 위한 코드¶

3. 영화 평점 데이터 정리¶

4. 그래프 그리기¶

'EDA' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바