r - Removing all words except for words in a vector -


it's common remove stopwords text or character vector. use function removewords tm package.

however, i'm trying remove words except stopwords. have list of words made called x. when use

removewords(text, x) 

i error:

in gsub(sprintf("(*ucp)\\b(%s)\\b", paste(sort(words, decreasing = true), pcre pattern compilation error 'regular expression large'` 

i've tried using grep:

grep(x, text) 

but won't work, because x vector , not single character string.

so, how can remove words aren't in vector? or alternatively, how can select words in vector?

if want x regex pattern grep, use x <- paste(x, collapse = "|"), allow words in text. keep in mind regex might still large. if want remove word not stopword(), can create own function:

keep_stopwords <- function(text) {   stop_regex <- paste(stopwords(), collapse = "\\b|\\b")   stop_regex <- paste("\\b", stop_regex, "\\b", sep = "")   tmp <- strsplit(text, " ")[[1]]   idx <- grepl(stop_regex, tmp)   txt <- paste(tmp[idx], collapse = " ")   return(txt) }  text = "how wood woodchuck if woodchuck chuck wood? more wood woodchucks chuck if woodchucks chuck wood, less wood other creatures termites." keep_stopwords(text) # [1] "would if if other" 

basically, setup stopwords() regex of words. have careful partial matches, wrap each stop word in \\b ensure it's full match. split string match each word individually , create index of words stop words. paste words again , return single string.

edit

here's approach, simpler , easier understand. doesn't rely on regular expressions, can expensive in large documents.

keep_words <- function(text, keep) {   words <- strsplit(text, " ")[[1]]   txt <- paste(words[words %in% keep], collapse = " ")   return(txt) } x <- "how wood woodchuck chuck if woodchuck chuck wood? more wood woodchucks chuck if woodchucks chuck wood, less wood other creatures termites." keep_words(x, stopwords()) # [1] "would if if other" 

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -