4.10 summarise()함수를 이용한 요약 통계량 계산

4.10 `summarise()`함수를 이용한 요약 통계량 계산

4.10.1 `summarise()` 함수의 기본 형식

summarise(dataframe, …, .groups = NULL)

summarize(dataframe, …, .groups = NULL)

datafreame : 데이터 세트

… : 한 개의 값만을 출력하는 함수(sum, mean, sd, var, …), n개의 값을 출력하는 함수(IQR, …), 하나의 식을 이용해 복수 개의 컬럼을 더하는 데이터 프레임 등

.groups : “drop_last,” “drop,” “keep,” “rowwise” 등의 옵션

summarise() 함수이 제공하는 수치형 데이터에 대한 요약 통계량 옵션의 예를 들면 다음과 같다.

mean(x, na.rm = TRUE) : 평균. 결측값(NA)을 제외하려면 na.rm = TRUE 추가

median(x, na.rm = TRUE) : 중앙값. 결측값(NA)을 제외하려면 na.rm = TRUE 추가

sd(x, na.rm = TRUE) : 표준편차. 결측값(NA)을 제외하려면 na.rm = TRUE 추가

min(x, na.rm = TRUE) : 최소값. 결측값(NA)을 제외하려면 na.rm = TRUE 추가

max(x, na.rm = TRUE) : 최대값. 결측값(NA)을 제외하려면 na.rm = TRUE 추가

IQR(x, na.rm = TRUE) : 3사분위수 - 1사분위수 (Inter Quartile Range = Q3 - Q1). 결측값(NA)을 제외하려면 na.rm = TRUE 추가

sum(x, na.rm = TRUE) : 합. 결측값(NA)을 제외하려면 na.rm = TRUE 추가

4.10.2 `summarise()` 함수를 이용한 요약 통계 계산

Cars93 데이터 프레임에서 가격(Price)의 a) 평균, b) 중앙값, c) 표준편차, d) 최소값, e) 최대값, f) 사분위 범위(IQR), g) 합계를 구해보자. (단, 결측값은 포함하지 않고 계산함 = na.rm = TRUE)

# summarise() : Summarise numeric values
# mean(), median(), sd(), min(), max(), IQR(), sum()
# IQR : IQR(Inter quartile Range) = Upper Quartile(Q3) - Lower Quartile(Q1)
summarise(Cars93, 
         Price_mean = mean(Price, na.rm = T),                 # mean of Price
         Price_median = median(Price, na.rm = T),             # median of Price
         Price_sd = sd(Price, na.rm = T),                     # standard deviation of Price
         Price_min = min(Price, na.rm = T),                   # min of Price
         Price_max = max(Price, na.rm = T),                   # max of Price
         Price_IQR = IQR(Price), na.rm = T,                   # IQR of Price
         Price_sum = sum(Price, na.rm = T))                   # sum of Price

##   Price_mean Price_median Price_sd Price_min Price_max Price_IQR na.rm
## 1   19.50968         17.7  9.65943       7.4      61.9      11.1  TRUE
##   Price_sum
## 1    1814.4

4.10.3 `summarise()`를 이용한 관측치의 갯수 및 색인 찾기

summarise()를 이용한 관측치의 갯수 및 색인을 찾기 위한 함수의 예는 다음과 같다.

n() : 관측치 겟수 계산, x 변수 입력하지 않음

n_distinct(x) : 중복없는 유일한 관측치 갯수 계산, 기준이 되는 x변수 입력함

first(x) : 기준이 되는 x변수의 첫번째 관측치

last(x) : 기준이 되는 x변수의 마지막 관측치

nth(x, n) : 기준이 되는 x변수의 n번째 관측치

Cars93_1 데이터 프레임에서 a) 총 관측치의 갯수, b) 제조사(Manufacturer)의 갯수(유일한 값), c) 첫번째 관측치의 제조사 이름, d) 마지막 관측치의 제조사 이름, e) 5번째 관측치의 제조사 이름을 확인해 보자.

# summarise() : n(), n_distinct(), first(), last(), nth() 
Cars93_1 <- Cars93[c(1:10), c("Manufacturer", "Model", "Type")]       # subset for better print 
Cars93_1

##    Manufacturer      Model    Type
## 1         Acura    Integra   Small
## 2         Acura     Legend Midsize
## 3          Audi         90 Compact
## 4          Audi        100 Midsize
## 5           BMW       535i Midsize
## 6         Buick    Century Midsize
## 7         Buick    LeSabre   Large
## 8         Buick Roadmaster   Large
## 9         Buick    Riviera Midsize
## 10     Cadillac    DeVille   Large

summarise(Cars93_1, 
          tot_cnt = n(),                                    # counting the number of all observations 
          Manufacturer_dist_cnt = n_distinct(Manufacturer), # distinct number of var 
          First_obs = first(Manufacturer),                  # first observation 
          Last_obs = last(Manufacturer),                    # last observation 
          Nth_5th_obs = nth(Manufacturer, 5))               # n'th observation

##   tot_cnt Manufacturer_dist_cnt First_obs Last_obs Nth_5th_obs
## 1      10                     5     Acura Cadillac         BMW

4.10.4 `summarise()` 함수를 이용한 그룹별 요약 통계량 계산

summarise() 함수를 이용하여 그룹별로 통계량을 계산(Grouped operations)하려면 group_by() 함수를 이용한다.

Cars93 데이터 프레임에서 ‘차종(Type)’ 별로 a) 전체 관측치 갯수, b) (중복 없는) 제조사 업체 수, c) 가격(Price)의 평균과 d) 가격의 표준편차를 구해 보자 (단, 결측값은 포함하지 않고 계산함).

# summarise by group 
grouped <- group_by(Cars93, Type) 
summarise(grouped, 
          n_conut = n(),                                  # counting the number of cars 
          Manufacturer_cnt = n_distinct(Manufacturer),    # distinct number of Manufacturer 
          Price_mean = mean(Price, na.rm = TRUE),         # mean of Price 
          Price_sd = sd(Price, na.rm = TRUE)              # standard deviation of Price 
          )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 6 x 5
##   Type    n_conut Manufacturer_cnt Price_mean Price_sd
##   <fct>     <int>            <int>      <dbl>    <dbl>
## 1 Compact      16               15       18.2     6.69
## 2 Large        11               10       24.3     6.34
## 3 Midsize      22               20       27.2    12.3 
## # ... with 3 more rows

# 또는
Cars93 %>% group_by(Type) %>% 
summarise(n_conut = n(),                                    # counting the number of cars 
            Manufacturer_cnt = n_distinct(Manufacturer),    # distinct number of Manufacturer 
            Price_mean = mean(Price, na.rm = TRUE),         # mean of Price 
            Price_sd = sd(Price, na.rm = TRUE)              # standard deviation of Price 
)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 6 x 5
##   Type    n_conut Manufacturer_cnt Price_mean Price_sd
##   <fct>     <int>            <int>      <dbl>    <dbl>
## 1 Compact      16               15       18.2     6.69
## 2 Large        11               10       24.3     6.34
## 3 Midsize      22               20       27.2    12.3 
## # ... with 3 more rows

4.10.5 복수 개의 변수에 동일한 `summarise()` 함수 적용하기

summarise_each() 함수를 이용하면 복수 개의 변수에 동일한 summarise() 함수를 적용할 수 있다.

Cars93 데이터 프레임의 가격(Price) 변수와 고속도로연비(MPG.highway) 등의 두개의 변수에 대해 a) 평균(mean), b) 중앙값(median), c) 표준편차(standard deviation) 등의 3개의 함수를 동시에 적용하여 계산해 보자.

# summarize_each() : applies the same summary function(s) to multiple variables 
summarise_each(Cars93, funs(mean, median, sd), Price, MPG.highway)

## Warning: `summarise_each_()` is deprecated as of dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

##   Price_mean MPG.highway_mean Price_median MPG.highway_median Price_sd
## 1   19.50968         29.08602         17.7                 28  9.65943
##   MPG.highway_sd
## 1       5.331726

summarise_each() 함수는 dplyr 0.7.0이후로 사용하지 않고 있으며, 그 대신 across() 함수를 이용하고 있다.

summarise_each() 함수 대신에 summarise() 함수 안에 across() 함수를 이용하면 다음과 같다.

# across() 함수의 이용
summarise(Cars93, across(c(Price, MPG.highway), list(mean=mean, median=median, sd=sd), na.rm= TRUE))

##   Price_mean Price_median Price_sd MPG.highway_mean MPG.highway_median
## 1   19.50968         17.7  9.65943         29.08602                 28
##   MPG.highway_sd
## 1       5.331726

Cars93 %>% 
  summarise(across(c(Price, MPG.highway), list(mean=mean, median=median, sd=sd), na.rm= TRUE))

##   Price_mean Price_median Price_sd MPG.highway_mean MPG.highway_median
## 1   19.50968         17.7  9.65943         29.08602                 28
##   MPG.highway_sd
## 1       5.331726

Reference

Introduction to dplyr (http://127.0.0.1:21980/library/dplyr/doc/introduction.html)
dplyr functions for single dataset (http://stat545.com/block010_dplyr-end-single-table.html)
dplyr tutorial (http://genomicsclass.github.io/book/pages/dplyr_tutorial.html)
R, Python 분석과 프로그래밍의 친구 [https://rfriend.tistory.com/234?category=601862]

4.10.1 summarise() 함수의 기본 형식

4.10.2 summarise() 함수를 이용한 요약 통계 계산

4.10.3 summarise()를 이용한 관측치의 갯수 및 색인 찾기

4.10.4 summarise() 함수를 이용한 그룹별 요약 통계량 계산

4.10.5 복수 개의 변수에 동일한 summarise() 함수 적용하기

Reference