Raw URL Coding in R - Roy Ratcliffe

R’s character vectors do not allow embedded nulls. Using R to encode and decode raw bytes requires some trickery.

Language Veneer

Underneath the language veneer, character vectors in R are null-terminated strings. Hence the data scientist cannot conveniently encode arbitrary raw strings that contain nulls. See the following example. R throws an error.

"\0"

Merely trying to evaluate this character vector gives the following error in R. Note that this is not an ordinary exception that the programmer can catch; it exists as a language-level error.

Error: nul character not allowed (<input>:1:12)

So let’s say that you want to encode and decode strings to URL-encoded bytes, both with and without embedded nulls.

R provides URLencode and URLdecode but these carry the same limitation: no nulls allowed. The solution below encodes directly from raw vectors. Raw vectors represent vectors of byte values, any byte value, including nulls. So encoding and decoding directly to and from raw vectors overcomes the limitations of character vectors.

Raw URL Encoding

The following adapts the base package’s URLencode albeit not vectorised, meaning that it operates on just one raw vector. The input \(x\) is a single raw, i.e. type raw, of any length. It answers one character vector representing the input by escaping the naughty bytes.

#' URL-encodes a raw vector.
#' @param x Raw vector to escape.
#' @param reserved Reserves non-escaped symbols if TRUE.
#' @returns Escaped character vector.
#' @examples
#' # One raw null to escaped "%00" character vector.
#' rawURLencode(as.raw(0L))
#'
#' # Encodes reserved symbols. Answers "%23%21".
#' rawURLencode(charToRaw("#!"), reserved = TRUE)
#' @export
rawURLencode <- \(x, reserved = FALSE) {
  if (length(x) == 0L) {
    return("")
  }
  ok <- charToRaw(paste0(
    if (!reserved) {
      "][!$&'()*+,;=:/?@#"
    },
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
    "abcdefghijklmnopqrstuvwxyz0123456789._~-"
  ))
  paste(vapply(x, \(xx) {
    if (xx %in% ok) {
      rawToChar(xx)
    } else {
      paste0("%", toupper(as.character(xx)), collapse = "")
    }
  }, character(1L)), collapse = "")
}

The implementation forms a vector of the non-escaped raw byte values, ok, which depends on reserved.

Raw URL Decoding

The operation in reverse appears below and applies a somewhat less functional approach along with a dash of hexadecimal jiggery-pokery.

#' URL-decodes a raw vector.
#' @param x Escaped character vector to decode.
#' @returns Non-escaped raw vector.
#' @examples
#' # No characters, no bytes.
#' rawURLdecode(character(0L))
#'
#' # Embedded null to raw vector with embedded 00 byte.
#' rawURLdecode("hello%00world")
#' @export
rawURLdecode <- \(x) {
  if (length(x) == 0L) {
    return(raw(0L))
  }
  x <- charToRaw(x)
  pc <- charToRaw("%")
  out <- raw(0L)
  i <- 1L
  while (i <= length(x)) {
    if (x[i] != pc) {
      out <- c(out, x[i])
      i <- i + 1L
    } else {
      out <- c(out, as.raw(paste0("0x", rawToChar(x[i + 1:2L]), collapse = "")))
      i <- i + 3L
    }
  }
  out
}

Decoding translates any %-escape string liberally and fails with coercion warnings if the escapes fail to decode to valid hexadecimal digit pairs. Only one edge case exists where the decoder accepts a final single-hexadecimal escape, e.g.

rawURLdecode("%f")

## [1] 0f

The answer is 0F even though the character vector truncates the final escape triple.

Why Useful?

“Raw” coding of URLs has its use cases for arbitrary stretches of bytes that need to pass through communication transport layers that restrict the encoding. URLs might be one example, though typically not over HTTP. There could be others. Redis clients sometimes restrict the encoding for example. Redis itself does not. Redis strings are arbitrary byte sequences without encoding limitations. Clients that connect to Redis, however, may require UTF-8 character strings without embedded nulls or any other Unicode complications.

For such scenarios, URL encoding is a simple solution. Encoding directly from raw bytes to encoded escaped character sequences helps to bypass the limitations. The encoding becomes straightforward ASCII characters. The same escaping also facilitates the reverse operation when decoding from escaped characters to raw bytes with nothing in between.

Pros and cons exist—it goes without saying. Sizes expand for largely escaped vectors. Escaped "\1\2\3" encodes \(3\times3=9\) characters, for instance; that amounts to a three-fold increase in encoded size for escaped bytes, although not all bytes need escaping. Still, it’s not the best; other encodings could accomplish the same goal more efficiently. Nevertheless, it is quite simple and portable. Everything is a tradeoff.