Title: | Alt String Implementation |
---|---|
Description: | Provides an extendable, performant and multithreaded 'alt-string' implementation backed by 'C++' vectors and strings. |
Authors: | Travers Ching [aut, cre, cph], Phillip Hazel [ctb] (Bundled PCRE2 code), Zoltan Herczeg [ctb, cph] (Bundled PCRE2 code), University of Cambridge [cph] (Bundled PCRE2 code), Tilera Corporation [cph] (Stack-less Just-In-Time compiler bundled with PCRE2), Yann Collet [ctb, cph] (Yann Collet is the author of the bundled xxHash code) |
Maintainer: | Travers Ching <[email protected]> |
License: | GPL-3 |
Version: | 0.16.0 |
Built: | 2024-11-24 03:05:18 UTC |
Source: | https://github.com/traversc/stringfish |
Converts a character vector to a stringfish vector
convert_to_sf(x) sf_convert(x)
convert_to_sf(x) sf_convert(x)
x |
A character vector |
Converts a character vector to a stringfish vector. The opposite of 'materialize'.
The converted character vector
if(getRversion() >= "3.5.0") { x <- convert_to_sf(letters) }
if(getRversion() >= "3.5.0") { x <- convert_to_sf(letters) }
Returns the type of the character vector
get_string_type(x)
get_string_type(x)
x |
the vector |
A function that returns the type of character vector. Possible values are "normal vector", "stringfish vector", "stringfish vector (materialized)" or "other alt-rep vector"
The type of vector
if(getRversion() >= "3.5.0") { x <- sf_vector(10) get_string_type(x) # returns "stringfish vector" x <- character(10) get_string_type(x) # returns "normal vector" }
if(getRversion() >= "3.5.0") { x <- sf_vector(10) get_string_type(x) # returns "stringfish vector" x <- character(10) get_string_type(x) # returns "normal vector" }
Materializes an alt-rep object
materialize(x)
materialize(x)
x |
An alt-rep object |
Materializes any alt-rep object and then returns it. Note: the object is materialized regardless of whether the return value is assigned to a variable.
x
if(getRversion() >= "3.5.0") { x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string") x <- materialize(x) }
if(getRversion() >= "3.5.0") { x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string") x <- materialize(x) }
A function that generates random strings
random_strings(N, string_size = 50, charset = "abcdefghijklmnopqrstuvwxyz", vector_mode = "stringfish")
random_strings(N, string_size = 50, charset = "abcdefghijklmnopqrstuvwxyz", vector_mode = "stringfish")
N |
The number of strings to generate |
string_size |
The length of the strings |
charset |
The characters used to generate the random strings (default: abcdefghijklmnopqrstuvwxyz) |
vector_mode |
The type of character vector to generate (either stringfish or normal, default: stringfish) |
The function uses the PCRE2 library, which is also used internally by R. Note: the order of paramters is switched compared to the 'gsub' base R function, with subject being first. See also: https://www.pcre.org/current/doc/html/pcre2api.html for more documentation on match syntax.
A character vector of the random strings
gsub
if(getRversion() >= "3.5.0") { set.seed(1) x <- random_strings(1e6, 80, "ACGT", vector_mode = "stringfish") }
if(getRversion() >= "3.5.0") { set.seed(1) x <- random_strings(1e6, 80, "ACGT", vector_mode = "stringfish") }
Assigns a new string to a stringfish vector or any other character vector
sf_assign(x, i, e)
sf_assign(x, i, e)
x |
the vector |
i |
the index to assign to |
e |
the new string to replace at i in x |
A function to assign a new element to an existing character vector. If the the vector is a stringfish vector, it does so without materialization.
No return value, the function assigns an element to an existing stringfish vector
if(getRversion() >= "3.5.0") { x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string") }
if(getRversion() >= "3.5.0") { x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string") }
Pastes a series of strings together separated by the 'collapse' parameter
sf_collapse(x, collapse)
sf_collapse(x, collapse)
x |
A character vector |
collapse |
A single string |
This works the same way as 'paste0(x, collapse=collapse)'
A single string with all values in 'x' pasted together, separated by 'collapse'.
paste0, paste
if(getRversion() >= "3.5.0") { x <- c("hello", "\\xe4\\xb8\\x96\\xe7\\x95\\x8c") Encoding(x) <- "UTF-8" sf_collapse(x, " ") # "hello world" in Japanese sf_collapse(letters, "") # returns the alphabet }
if(getRversion() >= "3.5.0") { x <- c("hello", "\\xe4\\xb8\\x96\\xe7\\x95\\x8c") Encoding(x) <- "UTF-8" sf_collapse(x, " ") # "hello world" in Japanese sf_collapse(letters, "") # returns the alphabet }
Returns a logical vector testing equality of strings from two string vectors
sf_compare(x, y, nthreads = getOption("stringfish.nthreads", 1L)) sf_equals(x, y, nthreads = getOption("stringfish.nthreads", 1L))
sf_compare(x, y, nthreads = getOption("stringfish.nthreads", 1L)) sf_equals(x, y, nthreads = getOption("stringfish.nthreads", 1L))
x |
A character vector of length 1 or the same non-zero length as y |
y |
Another character vector of length 1 or the same non-zero length as y |
nthreads |
Number of threads to use |
Note: the function tests for both string and encoding equality
A logical vector
if(getRversion() >= "3.5.0") { sf_compare(letters, "a") }
if(getRversion() >= "3.5.0") { sf_compare(letters, "a") }
Appends vectors together
sf_concat(...) sfc(...)
sf_concat(...) sfc(...)
... |
Any number of vectors, coerced to character vector if necessary |
A concatenated stringfish vector
if(getRversion() >= "3.5.0") { sf_concat(letters, 1:5) }
if(getRversion() >= "3.5.0") { sf_concat(letters, 1:5) }
A function for detecting a pattern at the end of a string
sf_ends(subject, pattern, ...)
sf_ends(subject, pattern, ...)
subject |
A character vector |
pattern |
A string to look for at the start |
... |
Parameters passed to sf_grepl |
A logical vector true if there is a match, false if no match, NA is the subject was NA
endsWith, sf_starts
if(getRversion() >= "3.5.0") { x <- c("alpha", "beta", "gamma", "delta", "epsilon") sf_ends(x, "a") }
if(getRversion() >= "3.5.0") { x <- c("alpha", "beta", "gamma", "delta", "epsilon") sf_ends(x, "a") }
A function that matches patterns and returns a logical vector
sf_grepl(subject, pattern, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))
sf_grepl(subject, pattern, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))
subject |
The subject character vector to search |
pattern |
The pattern to search for |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the pattern parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
The function uses the PCRE2 library, which is also used internally by R. The encoding is based on the pattern string (or forced via the encode_mode parameter). Note: the order of paramters is switched compared to the 'grepl' base R function, with subject being first. See also: https://www.pcre.org/current/doc/html/pcre2api.html for more documentation on match syntax.
A logical vector with the same length as subject
grepl
if(getRversion() >= "3.5.0") { x <- sf_vector(10) sf_assign(x, 1, "hello world") pattern <- "^hello" sf_grepl(x, pattern) }
if(getRversion() >= "3.5.0") { x <- sf_vector(10) sf_assign(x, 1, "hello world") pattern <- "^hello" sf_grepl(x, pattern) }
A function that performs pattern substitution
sf_gsub(subject, pattern, replacement, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))
sf_gsub(subject, pattern, replacement, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))
subject |
The subject character vector to search |
pattern |
The pattern to search for |
replacement |
The replacement string |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the pattern parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
The function uses the PCRE2 library, which is also used internally by R. However, syntax may be slightly different. E.g.: capture groups: "\1" in R, but "$1" in PCRE2 (as in Perl). The encoding of the output is determined by the pattern (or forced using encode_mode parameter) and encodings should be compatible. E.g: mixing ASCII and UTF-8 is okay, but not UTF-8 and latin1. Note: the order of paramters is switched compared to the 'gsub' base R function, with subject being first. See also: https://www.pcre.org/current/doc/html/pcre2api.html for more documentation on match syntax.
A stringfish vector of the replacement string
gsub
if(getRversion() >= "3.5.0") { x <- "hello world" pattern <- "^hello (.+)" replacement <- "goodbye $1" sf_gsub(x, pattern, replacement) }
if(getRversion() >= "3.5.0") { x <- "hello world" pattern <- "^hello (.+)" replacement <- "goodbye $1" sf_gsub(x, pattern, replacement) }
Converts encoding of one character vector to another
sf_iconv(x, from, to, nthreads = getOption("stringfish.nthreads", 1L))
sf_iconv(x, from, to, nthreads = getOption("stringfish.nthreads", 1L))
x |
An alt-rep object |
from |
the encoding to assume of 'x' |
nthreads |
Number of threads to use |
to |
the new encoding |
This is an analogue to the base R function 'iconv'. It converts a string from one encoding (e.g. latin1 or UTF-8) to another
the converted character vector as a stringfish vector
iconv
if(getRversion() >= "3.5.0") { x <- "fa\xE7ile" Encoding(x) <- "latin1" sf_iconv(x, "latin1", "UTF-8") }
if(getRversion() >= "3.5.0") { x <- "fa\xE7ile" Encoding(x) <- "latin1" sf_iconv(x, "latin1", "UTF-8") }
Returns a vector of the positions of x in table
sf_match(x, table, nthreads = getOption("stringfish.nthreads", 1L))
sf_match(x, table, nthreads = getOption("stringfish.nthreads", 1L))
x |
A character vector to search for in table |
table |
A character vector to be matched against x |
nthreads |
Number of threads to use |
Note: similarly to the base R function, long "table" vectors are not supported. This is due to the maximum integer value that can be returned ('.Machine$integer.max')
An integer vector of the indicies of each x element's position in table
match
if(getRversion() >= "3.5.0") { sf_match("c", letters) }
if(getRversion() >= "3.5.0") { sf_match("c", letters) }
Counts the number of characters in a character vector
sf_nchar(x, type = "chars", nthreads = getOption("stringfish.nthreads", 1L))
sf_nchar(x, type = "chars", nthreads = getOption("stringfish.nthreads", 1L))
x |
A character vector |
type |
The type of counting to perform ("chars" or "bytes", default: "chars") |
nthreads |
Number of threads to use |
Returns the number of characters per string. The type of counting only matters for UTF-8 strings, where a character can be represented by multiple bytes.
An integer vector of the number of characters
nchar
if(getRversion() >= "3.5.0") { x <- "fa\xE7ile" Encoding(x) <- "latin1" x <- sf_iconv(x, "latin1", "UTF-8") }
if(getRversion() >= "3.5.0") { x <- "fa\xE7ile" Encoding(x) <- "latin1" x <- sf_iconv(x, "latin1", "UTF-8") }
Pastes a series of strings together
sf_paste(..., sep = "", nthreads = getOption("stringfish.nthreads", 1L))
sf_paste(..., sep = "", nthreads = getOption("stringfish.nthreads", 1L))
... |
Any number of character vector strings |
sep |
The seperating string between strings |
nthreads |
Number of threads to use |
This works the same way as 'paste0(..., sep=sep)'
A character vector where elements of the arguments are pasted together
paste0, paste
if(getRversion() >= "3.5.0") { x <- letters y <- LETTERS sf_paste(x,y, sep = ":") }
if(getRversion() >= "3.5.0") { x <- letters y <- LETTERS sf_paste(x,y, sep = ":") }
A function that reads a file line by line
sf_readLines(file, encoding = "UTF-8")
sf_readLines(file, encoding = "UTF-8")
file |
The file name |
encoding |
The encoding to use (Default: UTF-8) |
A function for reading in text data using 'std::ifstream'.
A stringfish vector of the lines in a file
readLines
if(getRversion() >= "3.5.0") { file <- tempfile() sf_writeLines(letters, file) sf_readLines(file) }
if(getRversion() >= "3.5.0") { file <- tempfile() sf_writeLines(letters, file) sf_readLines(file) }
A function to split strings by a delimiter
sf_split(subject, split, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))
sf_split(subject, split, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))
subject |
A character vector |
split |
A delimiter to split the string by |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the split parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
A list of stringfish character vectors
strsplit
if(getRversion() >= "3.5.0") { sf_split(datasets::state.name, "\\s") # split U.S. state names by any space character }
if(getRversion() >= "3.5.0") { sf_split(datasets::state.name, "\\s") # split U.S. state names by any space character }
A function for detecting a pattern at the start of a string
sf_starts(subject, pattern, ...)
sf_starts(subject, pattern, ...)
subject |
A character vector |
pattern |
A string to look for at the start |
... |
Parameters passed to sf_grepl |
A logical vector true if there is a match, false if no match, NA is the subject was NA
startsWith, sf_ends
if(getRversion() >= "3.5.0") { x <- c("alpha", "beta", "gamma", "delta", "epsilon") sf_starts(x, "a") }
if(getRversion() >= "3.5.0") { x <- c("alpha", "beta", "gamma", "delta", "epsilon") sf_starts(x, "a") }
Extracts substrings from a character vector
sf_substr(x, start, stop, nthreads = getOption("stringfish.nthreads", 1L))
sf_substr(x, start, stop, nthreads = getOption("stringfish.nthreads", 1L))
x |
A character vector |
start |
The begining to extract from |
stop |
The end to extract from |
nthreads |
Number of threads to use |
This works the same way as 'substr', but in addition allows negative indexing. Negative indicies count backwards from the end of the string, with -1 being the last character.
A stringfish vector of substrings
substr
if(getRversion() >= "3.5.0") { x <- c("fa\xE7ile", "hello world") Encoding(x) <- "latin1" x <- sf_iconv(x, "latin1", "UTF-8") sf_substr(x, 4, -1) # extracts from the 4th character to the last ## [1] "ile" "lo world" }
if(getRversion() >= "3.5.0") { x <- c("fa\xE7ile", "hello world") Encoding(x) <- "latin1" x <- sf_iconv(x, "latin1", "UTF-8") sf_substr(x, 4, -1) # extracts from the 4th character to the last ## [1] "ile" "lo world" }
A function converting a string to all lowercase
sf_tolower(x)
sf_tolower(x)
x |
A character vector |
Note: the function only converts ASCII characters.
A stringfish vector where all uppercase is converted to lowercase
tolower
if(getRversion() >= "3.5.0") { x <- LETTERS sf_tolower(x) }
if(getRversion() >= "3.5.0") { x <- LETTERS sf_tolower(x) }
A function converting a string to all uppercase
sf_toupper(x)
sf_toupper(x)
x |
A character vector |
Note: the function only converts ASCII characters.
A stringfish vector where all lowercase is converted to uppercase
toupper
if(getRversion() >= "3.5.0") { x <- letters sf_toupper(x) }
if(getRversion() >= "3.5.0") { x <- letters sf_toupper(x) }
A function to remove leading/trailing whitespace
sf_trim(subject, which = c("both", "left", "right"), whitespace = "[ \\t\\r\\n]", ...)
sf_trim(subject, which = c("both", "left", "right"), whitespace = "[ \\t\\r\\n]", ...)
subject |
A character vector |
which |
"both", "left", or "right" determines which white space is removed |
whitespace |
Whitespace characters (default: "[ \\t\\r\\n]") |
... |
Parameters passed to sf_gsub |
A stringfish vector of trimmed whitespace
trimws
if(getRversion() >= "3.5.0") { x <- c(" alpha ", " beta", " gamma ", "delta ", "epsilon ") sf_trim(x) }
if(getRversion() >= "3.5.0") { x <- c(" alpha ", " beta", " gamma ", "delta ", "epsilon ") sf_trim(x) }
Creates a new stringfish vector
sf_vector(len)
sf_vector(len)
len |
length of the new vector |
This function creates a new stringfish vector, an alt-rep character vector backed by a C++ "std::vector" as the internal memory representation. The vector type is "sfstring", which is a simple C++ class containing a "std::string" and a single byte (uint8_t) representing the encoding.
A new (empty) stringfish vector
if(getRversion() >= "3.5.0") { x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string") }
if(getRversion() >= "3.5.0") { x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string") }
A function that reads a file line by line
sf_writeLines(text, file, sep = "\n", na_value = "NA", encode_mode = "UTF-8")
sf_writeLines(text, file, sep = "\n", na_value = "NA", encode_mode = "UTF-8")
text |
A character to write to file |
file |
Name of the file to write to |
sep |
The line separator character(s) |
na_value |
What to write in case of a NA string |
encode_mode |
"UTF-8" or "byte". If "UTF-8", all strings are re-encoded as UTF-8. |
A function for writing text data using 'std::ofstream'.
writeLines
if(getRversion() >= "3.5.0") { file <- tempfile() sf_writeLines(letters, file) sf_readLines(file) }
if(getRversion() >= "3.5.0") { file <- tempfile() sf_writeLines(letters, file) sf_readLines(file) }
A stricter comparison of string equality
string_identical(x, y)
string_identical(x, y)
x |
A character vector |
y |
Another character to compare to x |
TRUE if strings are identical, including encoding
identical
x <- "fa\xE7ile" Encoding(x) <- "latin1" y <- iconv(x, "latin1", "UTF-8") identical(x, y) # TRUE string_identical(x, y) # FALSE
x <- "fa\xE7ile" Encoding(x) <- "latin1" y <- iconv(x, "latin1", "UTF-8") identical(x, y) # TRUE string_identical(x, y) # FALSE