[11.16 화] 빅데이터 분석기사 실기

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

여유로움

[11.16 화] 빅데이터 분석기사 실기 - Pandas (1~5강) 본문

셀프스터디/빅데이터 분석기사

[11.16 화] 빅데이터 분석기사 실기 - Pandas (1~5강)

티로즈 2021. 11. 16. 20:42

1. 라이브러리 가져오기

import pandas as pd

2. 파일 읽어오기

엑셀파일 : pd.read_excel('파일이름', engine='openpyxl') #engine은 colab 환경에서 실행할 때 추가
CSV 파일 : pd.read_csv('파일이름', engine='openpyxl') #engine은 colab 환경에서 실행할 때 추가

# [1-1] youtube_rank.xlsx 파일을 DataFrame 으로 읽어 df라는 이름을 붙입니다

import pandas as pd

df = pd.read_excel('data_01/youtube_rank.xlsx', engine='openpyxl')

print(type(df))

# 원본을 복사하여 둡니다

original = df.copy()

temp = pd.read_csv('data_01/youtube_rank.csv')

print(type(temp))

3. 데이터 구조 확인

DataFrame.head(n=5) : 처음부터 n개 행의 데이터 가져오기
DataFrame.tail(n=5) : 마지막 n개 행의 데이터 가져오기
DataFrame.info(memory_usage='deep') : 데이터 프레임의 row 개수 및 각 column의 Non null, dtype 정보 및 메모리 사용량을 확인함
DataFrame.shape : 데이터 프레임의 행, 열의 수를 tuple로 반환

# [1-2] 읽어온 df 의 내용 중 첫 5개 행을 출력해 내용을 확인합니다

df.head()

df.tail(3)

# [1-3] df의 row의 개수 및 각 column의 정보 및 메모리 사용량을 확인합니다.

df.info()

[1-4] df의 행, 열의 수를 확인합니다. (shape 사용)

df.shape

4. DataFrame 구성요소

DataFrame.index : 행 index
DataFrame.columns : 열 index
DataFrame.values : 2차원 데이터

# [1-5] df의 index 구성요소 확인

df.index

# [1-6] df의 columns 구성요소 확인

df.columns

# [1-7] df의 values 구성요소 확인

# ndarray - numpy의 배열, df.values => 2차원

df.index.values[:10] # array (numpy의 ndarray)

df.columns.values

5. DataFrame의 한 개 Column은 Series이다

DataFrame[컬럼명] : Series
DataFrame[[컬럼명1, 컬럼명2, ...]] : DataFrame

# [1-8] 데이터프레임의 'video' 컬럼의 내용 중 첫 5개 행을 출력합니다.

# dtype : 'object' -> 문자열

s = df['video']

s.head()

s = df[['video']] #첫번째 대괄호은 indexing, 내부 대괄호는 '목록'을 의미

s.head(3) #colab 환경이 아닐때는 print(s.head(3))

6. Series 구성요소

Series.index : Series의 행 index
Series.values : 1차원 Series 데이터
DataFrame의 index, columns 및 Series의 index 는 대입연산을 사용하여 변경 가능하지만, 개수가 동일해야 함

# [1-9] Series의 index 구성요소 확인

s.index

# [1-10] Series의 values 구성요소 확인

# 1차원 numpy array (ndarray)

s.values[:10]

# [1-11] 컬럼의 이름을 ['채널', '카테고리', '구독자', '조회', '영상']으로 변경합니다.

# 변경 후 상위 2개 행을 출력해 봅니다.

df.columns = ['채널', '카테고리', '구독자', '조회', '영상']

df.head(2)

[데이터 타입 변경]

7. 데이터 dtype 확인 방법

DataFrame.info() : dtype 뿐 아니라 Non Null, Memory의 정보까지 표시됨
DataFrame.dtypes : 각 columns 별 dtype 확인
Series.dtype : Series의 dtype 확인

# [1-12] 원본(original)을 복사하여 df 이름을 부여합니다.

df = original.copy()

# [1-13] df의 내용 중 첫 5개 행을 출력합니다.

df.head(5)

# [1-14] df의 각 column 별 dtype을 확인합니다.

df.dtypes

# [1-15] df의 'subscriber' 컬럼의 데이터타입(dtype)을 확인해 봅니다.

print(df['subscriber'].dtype) #'o' : object 타입

# [1-16] df의 'subscriber'의 첫 3개 행을 출력해 내용을 확인합니다.

df['subscriber'].head(3)

8. 데이터 타입 변경 방법

Series.astype(타입)
타입 표시 방법 : 'int', 'int32', 'int64', 'float', 'str', 'category', ... 등의 문자열로 지정, np.int16, np.float32, np.datetime64, ... 등의 numpy 타입으로 지정
numpy 타입으로 지정하기 위해서는 import numpy as np 를 먼저 실행하여야 함

# [1-17] df의 'subscriber'컬럼의 dtype을 'int64' 으로 변경하여 보자

# 오류 발생함 - 한글, 특수문자 등이 섞여 있으면 정수로 변경할 수 없음

df['subscriber'].astype('int64')

9. 데이터 값 변경 방법

Series.replace(변경대상, 변경내용)
Series.replace([변경대상1, 변경대상2, ...], [변경내용1, 변경내용2, ...])
Series.replace({변경대상1:변경내용1, 변경대상2:변경내용2, ...})
DataFrame도 replace 메서드 있음
replace는 기본적으로 값 전체를 변경대상으로 지정함
regex=True 를 사용하면 변경대상을 일부 내용만 대상으로 지정 할 수 있음
regex => regular expression(정규식, 정규표현식)

# [1-18] df의 'subscriber' 컬럼에 대해 '만'을 '0000'으로 변경하는 코드를 작성합니다.

# replace는 기본적으로 값 전체를 대상으로 동작합니다.

# 일부 내용을 대상으로 하기 위해서는 regex=True 를 사용합니다.

# regex(=regular expression) - 문자열의 일부, 문자열 패턴을 사용하는 방법

df['subscriber'].replace('만','0000',regex=True).head(5)

#regex=True로 하면, '만'이 문자열 중간에 있어도 찾아서 변경

#replace 하면 새로운 데이터프레임이 생기므로, 저장해줘야 한다

# [1-19] df의 'subscriber' 컬럼에 대해 '만'을 '0000'으로 변경한 뒤, astype('int64')를 사용하여 dtype을 변경합니다.

df['subscriber'].replace('만','0000',regex=True).astype('int64')

# 변경할 내용이 여러 개 일 경우 두 개의 list 또는 dict 를 사용합니다.

df['view'].replace(['억', '만'],['00000000','0000'], regex=True).head(5)

df['view'].replace({'억':'00000000', '만':'0000'}, regex=True).head(5)

# [1-22] df의 'view' 컬럼에 대해 '억'에 대해서는 삭제, '만'에 대해서는 '0000'로 변경하는 작업을 수행한 뒤,

# astype('int64')를 사용하여 데이터 타입을 정수로 변경하고 처음 5개 데이터를 출력합니다.

df['view'].replace(['억', '만'],['','0000'], regex=True).astype('int64').head(5)

# [1-23] df의 'video' 컬럼의 첫 5개 데이터를 확인합니다.

df['video'].head(5)

# [1-24] df의 'video' 컬럼에서 '개'와 ','을 제거하는 작업을 수행한 뒤, 처음 5개 데이터를 출력합니다.

# 콤마(,)는 의미가 있는 문자이므로 \(역슬래쉬)를 함께 사용합니다.

# 메타문자 : , . + ? * ^ $ [ ] ... 등 , 메타문자의 기능을 없애기 위해 '\,' '\.'

df['video'].replace(['개', '\,'], ['',''], regex=True).head(5)

# [1-25] df의 'video' 컬럼에서 '개', ','을 제거하고, astype('int32')를 사용하여 는 작업을 수행한 뒤, 처음 5개 데이터를 출력합니다.

df['video'].replace(['개', '\,'], ['',''],regex=True).astype('int32').head(5)

10. 원본 데이터의 값 별 개수 확인

Series.value_counts() : 값 별 개수를 Series로 반환
개수가 많은 것부터 내림차순 정렬되어 반환됨
값이 index, 개수가 value 로 사용 됨

index의 정렬

DataFrame / Series.sort_index(ascending=True)

# [1-26] df의 열 중 'category' 컬럼에 대해 값 별 개수를 확인합니다.

df['category'].value_counts()

# [1-27] df의 'category'에 대해 첫글자와 마지막 글자인 '[', ']'을 제거합니다.

# replace를 사용하여 제거할 수 있지만, Accessor 중 str을 사용하여 보도록 합니다.

#df['category'].replace({'\[':'', '\]':''}, regex=True).head() # 첫글자, 마지막 글자가 아니라도 제거함

temp = df['category'].replace({'\[':'', '\]':''}, regex=True)

temp.value_counts()

# [1-28] df의 'category'에 대해 '[', ']'을 제거하고, astype('category') 를 사용하여 category 타입으로 변경합니다.

df['category'].replace(['\[','\]'], '', regex=True).astype('category') #인덱싱을 통해 메모리 용량이 줄어듬

df['category'].str[1:-1].astype('category')

# [1-29] 위의 replace 및 accessor-str을 사용한 작업을 요약해 보겠습니다.

df['subscriber'] = df['subscriber'].replace('만', '0000', regex=True).astype('int64')

df['view'] = df['view'].replace({'억':'', '만':'0000'}, regex=True).astype('int64')

df['video'] = df['video'].replace(['개', '\,'], '', regex=True).astype('int32')

df['category'] = df['category'].str[1:-1].astype('category')

df.info()

# [1-30] 'youtube_v1.xlsx' 엑셀 파일로 저장하기

df.to_excel('youtube_v1.xlsx', index=False)

# 다시 사용하기 위해서는 다운로드/업로드가 필요합니다. - 자동 삭제됨

temp = pd.read_excel('youtube_v1.xlsx', engine='openpyxl')

temp.info()

11. datetime, category 변경

pd.to_datetime(Series, format='형식')
- %Y: 4글자 년도, %y : 2글자 년도, %m : 2글자 월, %d : 2글자 일
- 다양한 형식 문자들 : https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
- format의 지정이 필수는 아님
pd.Categorical(Series, categories=['범주1', '범주2', ...], ordered=None)
- ordered=True 사용시 순서 있는 범주형, 정렬시 정해진 순서가 사용됨
- categories, ordered의 지정이 필수는 아님

temp = pd.DataFrame({'날짜_일반': ['2021/01/01', '2021/01/02', '2021/01/03', '2021/01/04', '2021/01/05'],
                     '날짜_시간': ['2021-01-01 1:12:10', '2021-01-02 1:13:45', '2021-01-03 2:50:10', '2021-01-04 3:12:30', '2021-01-05 5:40:20'],
                     '날짜_특수': ['21-01-01', '21-01-02', '21-01-03', '21-01-04', '21-01-05'],
                     '범주': ['금', '토', '일', '월', '화']})
print(temp)

import numpy as np
# Series.astype(np.datetime64) 을 사용하여 '날짜_일반'의 dtype을 변경해 보자
s1 = temp['날짜_일반'].astype(np.datetime64)
print(s1)

# pd.to_datetime(Series, format='형식')을 사용하여 '날짜_특수'의 dtype을 변경해 보자
s4 = pd.to_datetime(temp['날짜_특수'], format='%y-%m-%d')
print(s4)

# Series.astype('category')를 사용하여 '범주'의 dtype을 변경해 보자
s5 = temp['범주'].astype('category')

# 위에서 만들어진 '범주'를 정렬해 보자 (Series.sort_values() 사용)
s5.sort_values()

# pd.Categorical(Series, categories=카테고리목록, ordered=True)를 사용해 요일 순 범주를 만들어 보자
# 요일목록 => ['월', '화', '수', '목', '금', '토', '일']
s6= pd.Categorical(temp['범주'], categories=['월', '화', '수', '목', '금', '토', '일'], ordered=True)
print(s6)

# 위에서 만들어진 '범주'를 정렬해 보자 (Series.sort_values() 사용)
temp['범주'].sort_values()

출처 : 유튜브 강의 https://youtu.be/36RBSIP0t-8

'셀프스터디 > 빅데이터 분석기사' 카테고리의 다른 글

[11.24 수] 빅데이터 분석기사 실기 - Pandas (24~29강) (0)	2021.11.24
[11.23 화] 빅데이터 분석기사 실기 - Pandas (21~23강) (0)	2021.11.23
[11.22 월] 빅데이터 분석기사 실기 - Pandas (17~20강) (0)	2021.11.22
[11.21 일] 빅데이터 분석기사 실기 - Pandas (10~16강) (0)	2021.11.21
[11.17 수] 빅데이터 분석기사 실기 - Pandas (6~9강) (0)	2021.11.17

'셀프스터디/빅데이터 분석기사' Related Articles

여유로움

[11.16 화] 빅데이터 분석기사 실기 - Pandas (1~5강) 본문

[11.16 화] 빅데이터 분석기사 실기 - Pandas (1~5강)

'셀프스터디 > 빅데이터 분석기사' 카테고리의 다른 글

티스토리툴바