R’s character vectors do not allow embedded nulls. Using R to encode and decode raw bytes requires some trickery.
Language Veneer
Underneath the language veneer, character vectors in R are null-terminated strings. Hence the data scientist cannot conveniently encode arbitrary raw strings that contain nulls. See the following example. R throws an error.
"\0"
Merely trying to evaluate this character vector gives the following error in R. Note that this is not an ordinary exception that the programmer can catch; it exists as a language-level error.
Error: nul character not allowed (<input>:1:12)
So let’s say that you want to encode and decode strings to URL-encoded bytes, both with and without embedded nulls.
R provides URLencode
and URLdecode
but these carry the same
limitation: no nulls allowed. The solution below encodes directly from
raw vectors. Raw vectors represent vectors of byte values, any byte
value, including nulls. So encoding and decoding directly to and from
raw vectors overcomes the limitations of character vectors.
Raw URL Encoding
The following adapts the base
package’s URLencode
albeit not
vectorised, meaning that it operates on just one raw vector. The input
\(x\) is a single raw, i.e. type raw
, of any length. It answers one
character vector representing the input by escaping the naughty bytes.
#' URL-encodes a raw vector.
#' @param x Raw vector to escape.
#' @param reserved Reserves non-escaped symbols if TRUE.
#' @returns Escaped character vector.
#' @examples
#' # One raw null to escaped "%00" character vector.
#' rawURLencode(as.raw(0L))
#'
#' # Encodes reserved symbols. Answers "%23%21".
#' rawURLencode(charToRaw("#!"), reserved = TRUE)
#' @export
rawURLencode <- \(x, reserved = FALSE) {
if (length(x) == 0L) {
return("")
}
ok <- charToRaw(paste0(
if (!reserved) {
"][!$&'()*+,;=:/?@#"
},
"ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"abcdefghijklmnopqrstuvwxyz0123456789._~-"
))
paste(vapply(x, \(xx) {
if (xx %in% ok) {
rawToChar(xx)
} else {
paste0("%", toupper(as.character(xx)), collapse = "")
}
}, character(1L)), collapse = "")
}
The implementation forms a vector of the non-escaped raw byte values,
ok
, which depends on reserved
.
Raw URL Decoding
The operation in reverse appears below and applies a somewhat less functional approach along with a dash of hexadecimal jiggery-pokery.
#' URL-decodes a raw vector.
#' @param x Escaped character vector to decode.
#' @returns Non-escaped raw vector.
#' @examples
#' # No characters, no bytes.
#' rawURLdecode(character(0L))
#'
#' # Embedded null to raw vector with embedded 00 byte.
#' rawURLdecode("hello%00world")
#' @export
rawURLdecode <- \(x) {
if (length(x) == 0L) {
return(raw(0L))
}
x <- charToRaw(x)
pc <- charToRaw("%")
out <- raw(0L)
i <- 1L
while (i <= length(x)) {
if (x[i] != pc) {
out <- c(out, x[i])
i <- i + 1L
} else {
out <- c(out, as.raw(paste0("0x", rawToChar(x[i + 1:2L]), collapse = "")))
i <- i + 3L
}
}
out
}
Decoding translates any %
-escape string liberally and fails with
coercion warnings if the escapes fail to decode to valid hexadecimal
digit pairs. Only one edge case exists where the decoder accepts a final
single-hexadecimal escape, e.g.
rawURLdecode("%f")
## [1] 0f
The answer is 0F
even though the character vector truncates the final
escape triple.
Why Useful?
“Raw” coding of URLs has its use cases for arbitrary stretches of bytes that need to pass through communication transport layers that restrict the encoding. URLs might be one example, though typically not over HTTP. There could be others. Redis clients sometimes restrict the encoding for example. Redis itself does not. Redis strings are arbitrary byte sequences without encoding limitations. Clients that connect to Redis, however, may require UTF-8 character strings without embedded nulls or any other Unicode complications.
For such scenarios, URL encoding is a simple solution. Encoding directly from raw bytes to encoded escaped character sequences helps to bypass the limitations. The encoding becomes straightforward ASCII characters. The same escaping also facilitates the reverse operation when decoding from escaped characters to raw bytes with nothing in between.
Pros and cons exist—it goes without saying. Sizes expand for largely
escaped vectors. Escaped "\1\2\3"
encodes \(3\times3=9\) characters, for
instance; that amounts to a three-fold increase in encoded size for
escaped bytes, although not all bytes need escaping. Still, it’s not the
best; other encodings could accomplish the same goal more efficiently.
Nevertheless, it is quite simple and portable. Everything is a tradeoff.