2.3 Factor

2.3.1 범주형 자료

R에서 범주형 자료(Categorical Data)를 다룰 때는 문자형 자료와 잘 구별할 수 있어야 한다.

미리 범주형인지 문자형인지 확인하고 적절하게 분석 목적에 맞게끔 변환시켜야 한다. 이러한 과정들이 모두 데이터 전처리의 일부이다.

쉽게 표현하면, “vector(숫자형) + level(문자형) = factor”

2.3.2 Factor의 요소 및 특징

level

vector의 index. (그런데, level도 벡터이다.)
char 형이 기본값.
first level이 선형 모델링시 가장 basic level로 간주됨. ex] levels = c(“yes,” “no”)

특징 :

명목형 변수를 저장할 때에 메모리를 아껴준다.
- ex) “MALE,” “MALE,” “FEMALE” … 로 저장해주기 보다는 1, 1, 2, … 로 저장하고, 1 = MALE, 2 = FEMALE로 level로 묶어주는 것이 좀 더 메모리상에서 효율적
일반 vector와는 다르게 level을 설정 가능
levels을 통해 한번에 “척”하고 변경이 가능

2.3.3 `factor()` 함수의 형식

factor() 함수의 기본적인 형식 :

factor(x = character(), levels, labels = levels,
       exclude = NA, ordered = is.ordered(x), nmax = NA)

where,
x   : 소수의 구별되는 값들로 구성되는 데이터 벡터
levels  : as.character(x)에 의해 처리된 문자열 데이터의 유일 값들로, x의 오름차순으로 정렬된 것.
이는 sort(unique(x)) 보다 더 적은 수가 된다는 점을 주목하라.

labels : 실제로 보여지는 값
exclude : levels를 설정할 때 제외되는 값들의 벡터
ordered : levels에 순서를 지정
nmax : levels 갯수의 상한

2.3.4 factor 생성

x <- factor("문자벡터", levels = “벡터의 레벨(char)”, ordered = “ T/F” )

2.3.4.1 factor 생성

factor(c("yes", "no", "yes") )    #  디폴트로 레벨 : 오름차순 생성

## [1] yes no  yes
## Levels: no yes

2.3.4.2 factor에 levels 부여하기

factor(c("yes", "no", "yes"), levels = c("yes", "no"))

## [1] yes no  yes
## Levels: yes no

2.3.4.3 levels에 순위 부여하기

x <- factor(c("yes", "no", "yes"), levels = c("yes", "no") , ordered = T)    # 가장 처음에 온 값이 기본 레벨(basic levels)이 됨
x

## [1] yes no  yes
## Levels: yes < no

levels(x)[1:2]

## [1] "yes" "no"

levels(x)[1:2] <- "yes"
levels(x)

## [1] "yes"

x                             # levels에 접근하여 모든 값을 변경 가능

## [1] yes yes yes
## Levels: yes

2.3.4.4 factor의 exclude인자 활용하기

x <- factor(c("yes", "no", "yes", "yeah"), 
            levels = c("yes", "no", "yeah"), 
            ordered = T, 
            exclude = "yeah")
x           # exclude를 쓰면 NA 처리된다

## [1] yes  no   yes  <NA>
## Levels: yes < no

2.3.4.5 `addNA()` 함수 활용하기 :

addNA() 함수를 이용하여 NA를 levels에 추가하기

addNA(x, ifany = FALSE)     # Levels에 N/A를 넣고싶다면

## [1] yes  no   yes  <NA>
## Levels: yes < no < <NA>

2.3.4.6 `tapply()` 함수를 통해 factor 이해하기

age <- c(43,35,34,37,28,30,29,25,27,36,24,36,26,28,20)
gender <- factor(c('M','F','F','M','M','F','F','M','F','F','M','M','M','F','M'))
sal <- c(seq(100, 200, length.out=15))   
emp <- data.frame(age, gender, sal)  

emp$over30 <- ifelse(emp$age >= 35, 3, (ifelse(emp$age >=30, 2, ifelse(emp$age >=25, 1, 0))))
emp$over30 <- as.factor(emp$over30)
str(emp)

## 'data.frame':    15 obs. of  4 variables:
##  $ age   : num  43 35 34 37 28 30 29 25 27 36 ...
##  $ gender: Factor w/ 2 levels "F","M": 2 1 1 2 2 1 1 2 1 1 ...
##  $ sal   : num  100 107 114 121 129 ...
##  $ over30: Factor w/ 4 levels "0","1","2","3": 4 4 3 4 2 3 2 2 2 4 ...

round(tapply(emp$sal, list(emp$gender, emp$over30), mean))   # gender, over30 별로 sal의 급여 평균 구하기

##     0   1   2   3
## F  NA 164 125 136
## M 186 155  NA 133

2.3.4.7 `lm()` 함수를 통해 factor 이해하기

hsb2 <- read_csv("https://stats.idre.ucla.edu/stat/data/hsb2.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   id = col_double(),
##   female = col_double(),
##   race = col_double(),
##   ses = col_double(),
##   schtyp = col_double(),
##   prog = col_double(),
##   read = col_double(),
##   write = col_double(),
##   math = col_double(),
##   science = col_double(),
##   socst = col_double()
## )

str(hsb2)

## tibble [200 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id     : num [1:200] 70 121 86 141 172 113 50 11 84 48 ...
##  $ female : num [1:200] 0 1 0 0 0 0 0 0 0 0 ...
##  $ race   : num [1:200] 4 4 4 4 4 4 3 1 4 3 ...
##  $ ses    : num [1:200] 1 2 3 3 2 2 2 2 2 2 ...
##  $ schtyp : num [1:200] 1 1 1 1 1 1 1 1 1 1 ...
##  $ prog   : num [1:200] 1 3 1 3 2 2 1 2 1 2 ...
##  $ read   : num [1:200] 57 68 44 63 47 44 50 34 63 57 ...
##  $ write  : num [1:200] 52 59 33 44 52 52 59 46 57 55 ...
##  $ math   : num [1:200] 41 53 54 47 57 51 42 45 54 52 ...
##  $ science: num [1:200] 47 63 58 53 53 63 53 39 58 50 ...
##  $ socst  : num [1:200] 57 61 31 56 61 61 61 36 51 51 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   female = col_double(),
##   ..   race = col_double(),
##   ..   ses = col_double(),
##   ..   schtyp = col_double(),
##   ..   prog = col_double(),
##   ..   read = col_double(),
##   ..   write = col_double(),
##   ..   math = col_double(),
##   ..   science = col_double(),
##   ..   socst = col_double()
##   .. )

## race 컬럼에 factor 미적용시
summary(lm(write ~ race, data = hsb2))

## 
## Call:
## lm(formula = write ~ race, data = hsb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.919  -5.912   1.091   8.082  17.100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.8941     2.2652  20.260  < 2e-16 ***
## race          2.0061     0.6322   3.173  0.00175 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.27 on 198 degrees of freedom
## Multiple R-squared:  0.0484, Adjusted R-squared:  0.04359 
## F-statistic: 10.07 on 1 and 198 DF,  p-value: 0.001747

# 팩터 변수 생성 후 race 컬럼에 적용한 결과
hsb2$race.f <- factor(hsb2$race)    # race 컬럼의 팩터형 race.f 컬럼
is.factor(hsb2$race.f)              # race.f 컬럼이 factor 형인지 확인

## [1] TRUE

hsb2$race.f[1:15]                   # race.f 컬럼의 앞 15개 요소 확인

##  [1] 4 4 4 4 4 4 3 1 4 3 4 4 4 4 3
## Levels: 1 2 3 4

summary(lm(write ~ race.f, data = hsb2))  #  write = a * race.f + b 선형 회귀식의 요약 통계

## 
## Call:
## lm(formula = write ~ race.f, data = hsb2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0552  -5.4583   0.9724   7.0000  18.8000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   46.458      1.842  25.218  < 2e-16 ***
## race.f2       11.542      3.286   3.512 0.000552 ***
## race.f3        1.742      2.732   0.637 0.524613    
## race.f4        7.597      1.989   3.820 0.000179 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.025 on 196 degrees of freedom
## Multiple R-squared:  0.1071, Adjusted R-squared:  0.0934 
## F-statistic: 7.833 on 3 and 196 DF,  p-value: 5.785e-05

# ggplot(hsb2, aes(race.f, write)) + geom_point() + stat_smooth(method=lm, level = 0.95)

2.3.4.8 팩터변수를 외부에서 생성하기 싫은 경우 내부에 사용도 가능

hsb2 <- read.csv("https://stats.idre.ucla.edu/stat/data/hsb2.csv")

summary(lm(write ~ factor(race), data = hsb2))

## 
## Call:
## lm(formula = write ~ factor(race), data = hsb2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0552  -5.4583   0.9724   7.0000  18.8000 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     46.458      1.842  25.218  < 2e-16 ***
## factor(race)2   11.542      3.286   3.512 0.000552 ***
## factor(race)3    1.742      2.732   0.637 0.524613    
## factor(race)4    7.597      1.989   3.820 0.000179 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.025 on 196 degrees of freedom
## Multiple R-squared:  0.1071, Adjusted R-squared:  0.0934 
## F-statistic: 7.833 on 3 and 196 DF,  p-value: 5.785e-05

# ggplot(hsb2, aes(race, write)) + geom_point() + stat_smooth(aes(race, write), method=lm, level = 0.95)

Reference

R만의 명목변수 자료형 Factor 개념 이해하기