r - Removing all words except for words in a vector -
it's common remove stopwords text or character vector. use function removewords
tm
package.
however, i'm trying remove words except stopwords. have list of words made called x
. when use
removewords(text, x)
i error:
in gsub(sprintf("(*ucp)\\b(%s)\\b", paste(sort(words, decreasing = true), pcre pattern compilation error 'regular expression large'`
i've tried using grep
:
grep(x, text)
but won't work, because x
vector , not single character string.
so, how can remove words aren't in vector? or alternatively, how can select words in vector?
if want x
regex pattern grep, use x <- paste(x, collapse = "|")
, allow words in text
. keep in mind regex might still large. if want remove word not stopword()
, can create own function:
keep_stopwords <- function(text) { stop_regex <- paste(stopwords(), collapse = "\\b|\\b") stop_regex <- paste("\\b", stop_regex, "\\b", sep = "") tmp <- strsplit(text, " ")[[1]] idx <- grepl(stop_regex, tmp) txt <- paste(tmp[idx], collapse = " ") return(txt) } text = "how wood woodchuck if woodchuck chuck wood? more wood woodchucks chuck if woodchucks chuck wood, less wood other creatures termites." keep_stopwords(text) # [1] "would if if other"
basically, setup stopwords()
regex of words. have careful partial matches, wrap each stop word in \\b
ensure it's full match. split string match each word individually , create index of words stop words. paste words again , return single string.
edit
here's approach, simpler , easier understand. doesn't rely on regular expressions, can expensive in large documents.
keep_words <- function(text, keep) { words <- strsplit(text, " ")[[1]] txt <- paste(words[words %in% keep], collapse = " ") return(txt) } x <- "how wood woodchuck chuck if woodchuck chuck wood? more wood woodchucks chuck if woodchucks chuck wood, less wood other creatures termites." keep_words(x, stopwords()) # [1] "would if if other"
Comments
Post a Comment