4.2 dplyr
패키지에 있는 함수들
예제로 사용할 데이터는 MASS
패키지에 들어있는 Cars93
데이터프레임입니다.
원래는 93개의 자동차 관측치에 27개의 변수를 가지고 있는데, 예시들기 편하도록 앞에서부터 변수 8개만 선택해서 사용하겠다. (Cars93_1 dataframe
)
4.2.1 패키지 설치
library(dplyr)
library(MASS)
4.2.2 예제 데이터 확인
head(MASS::Cars93)
## Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
## 1 Acura Integra Small 12.9 15.9 18.8 25 31
## 2 Acura Legend Midsize 29.2 33.9 38.7 18 25
## 3 Audi 90 Compact 25.9 29.1 32.3 20 26
## 4 Audi 100 Midsize 30.8 37.7 44.6 19 26
## 5 BMW 535i Midsize 23.7 30.0 36.2 22 30
## 6 Buick Century Midsize 14.2 15.7 17.3 22 31
## AirBags DriveTrain Cylinders EngineSize Horsepower RPM
## 1 None Front 4 1.8 140 6300
## 2 Driver & Passenger Front 6 3.2 200 5500
## 3 Driver only Front 6 2.8 172 5500
## 4 Driver & Passenger Front 6 2.8 172 5500
## 5 Driver only Rear 4 3.5 208 5700
## 6 Driver only Front 4 2.2 110 5200
## Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1 2890 Yes 13.2 5 177 102
## 2 2335 Yes 18.0 5 195 115
## 3 2280 Yes 16.9 5 180 102
## 4 2535 Yes 21.1 6 193 106
## 5 2545 Yes 21.1 4 186 109
## 6 2565 No 16.4 6 189 105
## Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
## 1 68 37 26.5 11 2705 non-USA Acura Integra
## 2 71 38 30.0 15 3560 non-USA Acura Legend
## 3 67 37 28.0 14 3375 non-USA Audi 90
## 4 70 37 31.0 17 3405 non-USA Audi 100
## 5 69 39 27.0 13 3640 non-USA BMW 535i
## 6 69 41 28.0 16 2880 USA Buick Century
- 관측치 53,940개, 변수 10개로 이루어진 데이터임을 알 수 있다.
이 외에도 데이터를 확인하는 다양한 함수들은 다음과 같은 것들이 있다.
# Cars93 요약정보 확인
summary(Cars93)
## Manufacturer Model Type Min.Price Price
## Chevrolet: 8 100 : 1 Compact:16 Min. : 6.70 Min. : 7.40
## Ford : 8 190E : 1 Large :11 1st Qu.:10.80 1st Qu.:12.20
## Dodge : 6 240 : 1 Midsize:22 Median :14.70 Median :17.70
## Mazda : 5 300E : 1 Small :21 Mean :17.13 Mean :19.51
## Pontiac : 5 323 : 1 Sporty :14 3rd Qu.:20.30 3rd Qu.:23.30
## Buick : 4 535i : 1 Van : 9 Max. :45.40 Max. :61.90
## (Other) :57 (Other):87
## Max.Price MPG.city MPG.highway AirBags
## Min. : 7.9 Min. :15.00 Min. :20.00 Driver & Passenger:16
## 1st Qu.:14.7 1st Qu.:18.00 1st Qu.:26.00 Driver only :43
## Median :19.6 Median :21.00 Median :28.00 None :34
## Mean :21.9 Mean :22.37 Mean :29.09
## 3rd Qu.:25.3 3rd Qu.:25.00 3rd Qu.:31.00
## Max. :80.0 Max. :46.00 Max. :50.00
##
## DriveTrain Cylinders EngineSize Horsepower RPM
## 4WD :10 3 : 3 Min. :1.000 Min. : 55.0 Min. :3800
## Front:67 4 :49 1st Qu.:1.800 1st Qu.:103.0 1st Qu.:4800
## Rear :16 5 : 2 Median :2.400 Median :140.0 Median :5200
## 6 :31 Mean :2.668 Mean :143.8 Mean :5281
## 8 : 7 3rd Qu.:3.300 3rd Qu.:170.0 3rd Qu.:5750
## rotary: 1 Max. :5.700 Max. :300.0 Max. :6500
##
## Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers
## Min. :1320 No :32 Min. : 9.20 Min. :2.000
## 1st Qu.:1985 Yes:61 1st Qu.:14.50 1st Qu.:4.000
## Median :2340 Median :16.40 Median :5.000
## Mean :2332 Mean :16.66 Mean :5.086
## 3rd Qu.:2565 3rd Qu.:18.80 3rd Qu.:6.000
## Max. :3755 Max. :27.00 Max. :8.000
##
## Length Wheelbase Width Turn.circle
## Min. :141.0 Min. : 90.0 Min. :60.00 Min. :32.00
## 1st Qu.:174.0 1st Qu.: 98.0 1st Qu.:67.00 1st Qu.:37.00
## Median :183.0 Median :103.0 Median :69.00 Median :39.00
## Mean :183.2 Mean :103.9 Mean :69.38 Mean :38.96
## 3rd Qu.:192.0 3rd Qu.:110.0 3rd Qu.:72.00 3rd Qu.:41.00
## Max. :219.0 Max. :119.0 Max. :78.00 Max. :45.00
##
## Rear.seat.room Luggage.room Weight Origin Make
## Min. :19.00 Min. : 6.00 Min. :1695 USA :48 Acura Integra: 1
## 1st Qu.:26.00 1st Qu.:12.00 1st Qu.:2620 non-USA:45 Acura Legend : 1
## Median :27.50 Median :14.00 Median :3040 Audi 100 : 1
## Mean :27.83 Mean :13.89 Mean :3073 Audi 90 : 1
## 3rd Qu.:30.00 3rd Qu.:15.00 3rd Qu.:3525 BMW 535i : 1
## Max. :36.00 Max. :22.00 Max. :4105 Buick Century: 1
## NA's :2 NA's :11 (Other) :87
::datatable(Cars93) DT
str(Cars93)
## 'data.frame': 93 obs. of 27 variables:
## $ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
## $ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
## $ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
## $ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
## $ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
## $ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
## $ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
## $ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ...
## $ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
## $ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
## $ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
## $ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
## $ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ...
## $ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
## $ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
## $ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
## $ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
## $ Passengers : int 5 5 5 6 4 6 6 6 5 6 ...
## $ Length : int 177 195 180 193 186 189 200 216 198 206 ...
## $ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ...
## $ Width : int 68 71 67 70 69 69 74 78 73 73 ...
## $ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ...
## $ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
## $ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ...
## $ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
## $ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
## $ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...
::glimpse(Cars93) dplyr
## Rows: 93
## Columns: 27
## $ Manufacturer <fct> Acura, Acura, Audi, Audi, BMW, Buick, Buick, Bui...
## $ Model <fct> Integra, Legend, 90, 100, 535i, Century, LeSabre...
## $ Type <fct> Small, Midsize, Compact, Midsize, Midsize, Midsi...
## $ Min.Price <dbl> 12.9, 29.2, 25.9, 30.8, 23.7, 14.2, 19.9, 22.6, ...
## $ Price <dbl> 15.9, 33.9, 29.1, 37.7, 30.0, 15.7, 20.8, 23.7, ...
## $ Max.Price <dbl> 18.8, 38.7, 32.3, 44.6, 36.2, 17.3, 21.7, 24.9, ...
## $ MPG.city <int> 25, 18, 20, 19, 22, 22, 19, 16, 19, 16, 16, 25, ...
## $ MPG.highway <int> 31, 25, 26, 26, 30, 31, 28, 25, 27, 25, 25, 36, ...
## $ AirBags <fct> None, Driver & Passenger, Driver only, Driver & ...
## $ DriveTrain <fct> Front, Front, Front, Front, Rear, Front, Front, ...
## $ Cylinders <fct> 4, 6, 6, 6, 4, 4, 6, 6, 6, 8, 8, 4, 4, 6, 4, 6, ...
## $ EngineSize <dbl> 1.8, 3.2, 2.8, 2.8, 3.5, 2.2, 3.8, 5.7, 3.8, 4.9...
## $ Horsepower <int> 140, 200, 172, 172, 208, 110, 170, 180, 170, 200...
## $ RPM <int> 6300, 5500, 5500, 5500, 5700, 5200, 4800, 4000, ...
## $ Rev.per.mile <int> 2890, 2335, 2280, 2535, 2545, 2565, 1570, 1320, ...
## $ Man.trans.avail <fct> Yes, Yes, Yes, Yes, Yes, No, No, No, No, No, No,...
## $ Fuel.tank.capacity <dbl> 13.2, 18.0, 16.9, 21.1, 21.1, 16.4, 18.0, 23.0, ...
## $ Passengers <int> 5, 5, 5, 6, 4, 6, 6, 6, 5, 6, 5, 5, 5, 4, 6, 7, ...
## $ Length <int> 177, 195, 180, 193, 186, 189, 200, 216, 198, 206...
## $ Wheelbase <int> 102, 115, 102, 106, 109, 105, 111, 116, 108, 114...
## $ Width <int> 68, 71, 67, 70, 69, 69, 74, 78, 73, 73, 74, 66, ...
## $ Turn.circle <int> 37, 38, 37, 37, 39, 41, 42, 45, 41, 43, 44, 38, ...
## $ Rear.seat.room <dbl> 26.5, 30.0, 28.0, 31.0, 27.0, 28.0, 30.5, 30.5, ...
## $ Luggage.room <int> 11, 15, 14, 17, 13, 16, 17, 21, 14, 18, 14, 13, ...
## $ Weight <int> 2705, 3560, 3375, 3405, 3640, 2880, 3470, 4105, ...
## $ Origin <fct> non-USA, non-USA, non-USA, non-USA, non-USA, USA...
## $ Make <fct> Acura Integra, Acura Legend, Audi 90, Audi 100, ...
# subset Cars93
<- Cars93[, 1:8]
Cars93_1 str(Cars93_1)
## 'data.frame': 93 obs. of 8 variables:
## $ Manufacturer: Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
## $ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
## $ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
## $ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
## $ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
## $ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
## $ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
## $ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ...
Cars93_1
데이터 프레임에 대하여 str(Cars93_1)
으로 데이터 구조를 확인해 본다.
컬럼(변수) 갯수, 컬럼(변수) 명, 관찰치 개수, 관찰치 미리보기 등을 확인해 보면 다음과 같다.
- 데이터 구조 :
'data.frame' :
- 컬럼(변수) 갯수 :
8 variables
- 컬럼(변수) 명 :
$ Manufacturer
,$ Model
,$ Type
,$ Min.Price
,Price
,Max.Price
,MPG.City
,MPG.highway
등 8개 컬럼(변수)의 이름 - 관찰치 개수 :
93 obs.
- 관찰치 미리보기 : 각 컬럼별 관찰치의 데이터 타입과 실제 데이터를 보여준다.
$ Manufacturer: Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
$ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
$ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
$ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
$ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
$ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
$ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
$ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ...
실제 데이터(관찰치)의 내용은 View(Cars93_1)
로 확인할 수 있다.
View(Cars93_1)
- 8개의 컬럼(변수)
- 93개의 행(관찰치)
4.2.3 dplyr
패키지의 주요 함수 목록
단일 테이블을 대상으로 하는 dplyr
패키지의 함수들(Single table verbs)을 표로 정리해보면 아래와 같습니다.
dplyr verbs | description | similar {package} function |
---|---|---|
filter() | Filter rows with condition | {base} subset |
slice() | Filter rows with position | {base} subset |
arrange() | Re-order or arrange rows | {base} order |
select() | Select columns | {base} subset |
select(df, starts_with()) | Select columns that start with a prefix | |
select(df, ends_with()) | Select columns that end with a prefix | |
select(df, contains()) | Select columns that contain a character string | |
select(df, matchs()) | Select columns that match a regular expression | |
select(df, one_of()) | Select columns that are from a group of names | |
select(df, num_range()) | Select columns from num_range a to n with a prefix | |
rename() | Rename column name | {reshape} rename |
distinct() | Extract distinct(unique) rows | {base} unique |
sample_n() | Random sample rows for a fixed number | {base} sample |
sample_frac() | Random sample rows for a fixed fraction | {base} sample |
mutate() | Create(add) new columns. mutate() allows you to refer to columns that you’ve just created. | {base} transform |
transmute() | Create(add) new columns. transmute() only keeps the new columns. | {base} transform |
summarise() | Summarise values | {base} summary |