Strings

Strings are finite sequences of characters. Of course, the real trouble comes when one asks what a character is. The characters that English speakers are familiar with are the letters A, B, C, etc., together with numerals and common punctuation symbols. These characters are standardized together with a mapping to integer values between 0 and 127 by the UTF-8 encoding. (A transcode function is provided to convert to/from other Unicode encodings.)

  • All string types are subtypes of the abstract type AbstractString, and external packages define additional AbstractString subtypes (e.g. for other encodings). If you define a function expecting a string argument, you should declare the type as AbstractString in order to accept any string type.
  • Like C and Java, but unlike most dynamic languages, Julia has a first-class type for representing a single character, called AbstractChar. The built-in Char subtype of AbstractChar is a 32-bit primitive type that can represent any Unicode character (and which is based on the UTF-8 encoding).
  • As in Java, strings are immutable: the value of an AbstractString object cannot be changed. To construct a different string value, you construct a new string from parts of other strings.
  • Conceptually, a string is a partial function from indices to characters: for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings.
  • Characters

    A Char value represents a single character: it is just a 32-bit primitive type with a special literal representation and appropriate arithmetic behaviors, and which can be converted to a numeric value representing a text encodings.) Here is how Char values are input and shown (note that character literals are delimited with single quotes, not double quotes):

    julia> c = 'x'
    'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
    
    julia> typeof(c)
    Char

    You can easily convert a Char to its integer value, i.e. code point:

    julia> c = Int('x')
    120
    
    julia> typeof(c)
    Int64

    On 32-bit architectures, typeof(c) will be Int32. You can convert an integer value back to a Char just as easily:

    julia> Char(120)
    'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)

    Not all integer values are valid Unicode code points, but for performance, the Char conversion does not check that every character value is valid. If you want to check that each converted value is a valid code point, use the isvalid function:

    julia> Char(0x110000)
    '\U110000': Unicode U+110000 (category In: Invalid, too high)
    
    julia> isvalid(Char, 0x110000)
    false

    As of this writing, the valid Unicode code points are U+0000 through U+D7FF and U+E000 through U+10FFFF. These have not all been assigned intelligible meanings yet, nor are they necessarily interpretable by applications, but all of these values are considered to be valid Unicode characters.

    You can input any Unicode character in single quotes using \u followed by up to four hexadecimal digits or \U followed by up to eight hexadecimal digits (the longest valid value only requires six):

    julia> '\u0'
    '\0': ASCII/Unicode U+0000 (category Cc: Other, control)
    
    julia> '\u78'
    'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
    
    julia> '\u2200'
    '∀': Unicode U+2200 (category Sm: Symbol, math)
    
    julia> '\U10ffff'
    '\U10ffff': Unicode U+10FFFF (category Cn: Other, not assigned)

    Julia uses your system's locale and language settings to determine which characters can be printed as-is and which must be output using the generic, escaped \u or \U input forms. In addition to these Unicode escape forms, all of C's traditional escaped input forms can also be used:

    julia> Int('\0')
    0
    
    julia> Int('\t')
    9
    
    julia> Int('\n')
    10
    
    julia> Int('\e')
    27
    
    julia> Int('\x7f')
    127
    
    julia> Int('\177')
    127

    You can do comparisons and a limited amount of arithmetic with Char values:

    julia> 'A' < 'a'
    true
    
    julia> 'A' <= 'a' <= 'Z'
    false
    
    julia> 'A' <= 'X' <= 'Z'
    true
    
    julia> 'x' - 'a'
    23
    
    julia> 'A' + 1
    'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)

    String Basics

    String literals are delimited by double quotes or triple double quotes (not single quotes):

    julia> str = "Hello, world.\n"
    "Hello, world.\n"
    
    julia> """Contains "quote" characters"""
    "Contains \"quote\" characters"

    Long lines in strings can be broken up by preceding the newline with a backslash (\):

    julia> "This is a long \
           line"
    "This is a long line"

    If you want to extract a character from a string, you index into it:

    julia> str[begin]
    'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
    
    julia> str[1]
    'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
    
    julia> str[6]
    ',': ASCII/Unicode U+002C (category Po: Punctuation, other)
    
    julia> str[end]
    '\n': ASCII/Unicode U+000A (category Cc: Other, control)

    Many Julia objects, including strings, can be indexed with integers. The index of the first element (the first character of a string) is returned by firstindex(str), and the index of the last element (character) with lastindex(str). The keywords begin and end can be used inside an indexing operation as shorthand for the first and last indices, respectively, along the given dimension. String indexing, like most indexing in Julia, is 1-based: firstindex always returns 1 for any AbstractString. As we will see below, however, lastindex(str) is not in general the same as length(str) for a string, because some Unicode characters can occupy multiple "code units".

    You can perform arithmetic and other operations with end, just like a normal value:

    julia> str[end-1]
    '.': ASCII/Unicode U+002E (category Po: Punctuation, other)
    
    julia> str[end÷2]
    ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

    Using an index less than begin (1) or greater than end raises an error:

    julia> str[begin-1]
    ERROR: BoundsError: attempt to access 14-codeunit String at index [0]
    [...]
    
    julia> str[end+1]
    ERROR: BoundsError: attempt to access 14-codeunit String at index [15]
    [...]

    You can also extract a substring using range indexing:

    julia> str[4:9]
    "lo, wo"

    Notice that the expressions str[k] and str[k:k] do not give the same result:

    julia> str[6]
    ',': ASCII/Unicode U+002C (category Po: Punctuation, other)
    
    julia> str[6:6]
    ","

    The former is a single character value of type Char, while the latter is a string value that happens to contain only a single character. In Julia these are very different things.

    Range indexing makes a copy of the selected part of the original string. Alternatively, it is possible to create a view into a string using the type SubString. More simply, using the @views macro on a block of code converts all string slices into substrings. For example:

    julia> str = "long string"
    "long string"
    
    julia> substr = SubString(str, 1, 4)
    "long"
    
    julia> typeof(substr)
    SubString{String}
    
    julia> @views typeof(str[1:4]) # @views converts slices to SubStrings
    SubString{String}

    Several standard functions like chop, chomp or strip return a SubString.

    Unicode and UTF-8

    Julia fully supports Unicode characters and strings. As discussed above, in character literals, Unicode code points can be represented using Unicode \u and \U escape sequences, as well as all the standard C escape sequences. These can likewise be used to write string literals:

    julia> s = "\u2200 x \u2203 y"
    "∀ x ∃ y"

    Whether these Unicode characters are displayed as escapes or shown as special characters depends on your terminal's locale settings and its support for Unicode. String literals are encoded using the UTF-8 encoding. UTF-8 is a variable-width encoding, meaning that not all characters are encoded in the same number of bytes ("code units"). In UTF-8, ASCII characters — i.e. those with code points less than 0x80 (128) – are encoded as they are in ASCII, using a single byte, while code points 0x80 and above are encoded using multiple bytes — up to four per character.

    String indices in Julia refer to code units (= bytes for UTF-8), the fixed-width building blocks that are used to encode arbitrary characters (code points). This means that not every index into a String is necessarily a valid index for a character. If you index into a string at such an invalid byte index, an error is thrown:

    julia> s[1]
    '∀': Unicode U+2200 (category Sm: Symbol, math)
    
    julia> s[2]
    ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' '
    Stacktrace:
    [...]
    
    julia> s[3]
    ERROR: StringIndexError: invalid index [3], valid nearby indices [1]=>'∀', [4]=>' '
    Stacktrace:
    [...]
    
    julia> s[4]
    ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

    In this case, the character is a three-byte character, so the indices 2 and 3 are invalid and the next character's index is 4; this next valid index can be computed by nextind(s,1), and the next index after that by nextind(s,4) and so on.

    Since end is always the last valid index into a collection, end-1 references an invalid byte index if the second-to-last character is multibyte.

    julia> s[end-1]
    ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
    
    julia> s[end-2]
    ERROR: StringIndexError: invalid index [9], valid nearby indices [7]=>'∃', [10]=>' '
    Stacktrace:
    [...]
    
    julia> s[prevind(s, end, 2)]
    '∃': Unicode U+2203 (category Sm: Symbol, math)

    The first case works, because the last character y and the space are one-byte characters, whereas end-2 indexes into the middle of the multibyte representation. The correct way for this case is using prevind(s, lastindex(s), 2) or, if you're using that value to index into s you can write s[prevind(s, end, 2)] and end expands to lastindex(s).

    Extraction of a substring using range indexing also expects valid byte indices or an error is thrown:

    julia> s[1:1]
    "∀"
    
    julia> s[1:2]
    ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' '
    Stacktrace:
    [...]
    
    julia> s[1:4]
    "∀ "

    Because of variable-length encodings, the number of characters in a string (given by length(s)) is not always the same as the last index. If you iterate through the indices 1 through lastindex(s) and index into s, the sequence of characters returned when errors aren't thrown is the sequence of characters comprising the string s. Thus length(s) <= lastindex(s), since each character in a string must have its own index. The following is an inefficient and verbose way to iterate through the characters of s:

    julia> for i = firstindex(s):lastindex(s)
               try
                   println(s[i])
               catch
                   # ignore the index error
               end
           end
    ∀
    
    x
    
    ∃
    
    y

    The blank lines actually have spaces on them. Fortunately, the above awkward idiom is unnecessary for iterating through the characters in a string, since you can just use the string as an iterable object, no exception handling required:

    julia> for c in s
               println(c)
           end
    ∀
    
    x
    
    ∃
    
    y

    If you need to obtain valid indices for a string, you can use the nextind and prevind functions to increment/decrement to the next/previous valid index, as mentioned above. You can also use the eachindex function to iterate over the valid character indices:

    julia> collect(eachindex(s))
    7-element Vector{Int64}:
      1
      4
      5
      6
      7
     10
     11

    To access the raw code units (bytes for UTF-8) of the encoding, you can use the codeunit(s,i) function, where the index i runs consecutively from 1 to ncodeunits(s). The codeunits(s) function returns an AbstractVector{UInt8} wrapper that lets you access these raw codeunits (bytes) as an array.

    Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to treat any byte sequence as a String. In such situations a rule is that when parsing a sequence of code units from left to right characters are formed by the longest sequence of 8-bit code units that matches the start of one of the following bit patterns (each x can be 0 or 1):

    In particular this means that overlong and too-high code unit sequences and prefixes thereof are treated as a single invalid character rather than multiple invalid characters. This rule may be best explained with an example:

    julia> s = "\xc0\xa0\xe2\x88\xe2|"
    "\xc0\xa0\xe2\x88\xe2|"
    
    julia> foreach(display, s)
    '\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)
    '\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data)
    '\xe2': Malformed UTF-8 (category Ma: Malformed, bad data)
    '|': ASCII/Unicode U+007C (category Sm: Symbol, math)
    
    julia> isvalid.(collect(s))
    4-element BitArray{1}:
     0
     0
     0
     1
    
    julia> s2 = "\xf7\xbf\xbf\xbf"
    "\U1fffff"
    
    julia> foreach(display, s2)
    '\U1fffff': Unicode U+1FFFFF (category In: Invalid, too high)

    We can see that the first two code units in the string s form an overlong encoding of space character. It is invalid, but is accepted in a string as a single character. The next two code units form a valid start of a three-byte UTF-8 sequence. However, the fifth code unit \xe2 is not its valid continuation. Therefore code units 3 and 4 are also interpreted as malformed characters in this string. Similarly code unit 5 forms a malformed character because | is not a valid continuation to it. Finally the string s2 contains one too high code point.

    Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages. For example, the LegacyStrings.jl package implements UTF16String and UTF32String types. Additional discussion of other encodings and how to implement support for them is beyond the scope of this document for the time being. For further discussion of UTF-8 encoding issues, see the section below on byte array literals. The transcode function is provided to convert data between the various UTF-xx encodings, primarily for working with external data and libraries.

    Concatenation

    One of the most common and useful string operations is concatenation:

    julia> greet = "Hello"
    "Hello"
    
    julia> whom = "world"
    "world"
    
    julia> string(greet, ", ", whom, ".\n")
    "Hello, world.\n"

    It's important to be aware of potentially dangerous situations such as concatenation of invalid UTF-8 strings. The resulting string may contain different characters than the input strings, and its number of characters may be lower than sum of numbers of characters of the concatenated strings, e.g.:

    julia> a, b = "\xe2\x88", "\x80"
    ("\xe2\x88", "\x80")
    
    julia> c = string(a, b)
    "∀"
    
    julia> collect.([a, b, c])
    3-element Vector{Vector{Char}}:
     ['\xe2\x88']
     ['\x80']
     ['∀']
    
    julia> length.([a, b, c])
    3-element Vector{Int64}:
     1
     1
     1

    This situation can happen only for invalid UTF-8 strings. For valid UTF-8 strings concatenation preserves all characters in strings and additivity of string lengths.

    Julia also provides * for string concatenation:

    julia> greet * ", " * whom * ".\n"
    "Hello, world.\n"

    While * may seem like a surprising choice to users of languages that provide + for string concatenation, this use of * has precedent in mathematics, particularly in abstract algebra.

    In mathematics, + usually denotes a commutative operation, where the order of the operands does not matter. An example of this is matrix addition, where A + B == B + A for any matrices A and B that have the same shape. In contrast, * typically denotes a noncommutative operation, where the order of the operands does matter. An example of this is matrix multiplication, where in general A * B != B * A. As with matrix multiplication, string concatenation is noncommutative: greet * whom != whom * greet. As such, * is a more natural choice for an infix string concatenation operator, consistent with common mathematical use.

    More precisely, the set of all finite-length strings S together with the string concatenation operator * forms a free monoid (S, *). The identity element of this set is the empty string, "". Whenever a free monoid is not commutative, the operation is typically represented as \cdot, *, or a similar symbol, rather than +, which as stated usually implies commutativity.

    Interpolation

    Constructing strings using concatenation can become a bit cumbersome, however. To reduce the need for these verbose calls to string or repeated multiplications, Julia allows interpolation into string literals using $, as in Perl:

    julia> greet = "Hello"; whom = "world";
    
    julia> "$greet, $whom.\n"
    "Hello, world.\n"

    This is more readable and convenient and equivalent to the above string concatenation – the system rewrites this apparent single string literal into the call string(greet, ", ", whom, ".\n").

    The shortest complete expression after the $ is taken as the expression whose value is to be interpolated into the string. Thus, you can interpolate any expression into a string using parentheses:

    julia> "1 + 2 = $(1 + 2)"
    "1 + 2 = 3"

    Both concatenation and string interpolation call string to convert objects into string form. However, string actually just returns the output of show instead of string.

    Most non-AbstractString objects are converted to strings closely corresponding to how they are entered as literal expressions:

    julia> v = [1,2,3]
    3-element Vector{Int64}:
     1
     2
     3
    
    julia> "v: $v"
    "v: [1, 2, 3]"

    string is the identity for AbstractString and AbstractChar values, so these are interpolated into strings as themselves, unquoted and unescaped:

    julia> c = 'x'
    'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
    
    julia> "hi, $c"
    "hi, x"

    To include a literal $ in a string literal, escape it with a backslash:

    julia> print("I have \$100 in my account.\n")
    I have $100 in my account.

    Triple-Quoted String Literals

    When strings are created using triple-quotes ("""...""") they have some special behavior that can be useful for creating longer blocks of text.

    First, triple-quoted strings are also dedented to the level of the least-indented line. This is useful for defining strings within code that is indented. For example:

    julia> str = """
               Hello,
               world.
             """
    "  Hello,\n  world.\n"

    In this case the final (empty) line before the closing """ sets the indentation level.

    The dedentation level is determined as the longest common starting sequence of spaces or tabs in all lines, excluding the line following the opening """ and lines containing only spaces or tabs (the line containing the closing """ is always included). Then for all lines, excluding the text following the opening """, the common starting sequence is removed (including lines containing only spaces and tabs if they start with this sequence), e.g.:

    julia> """    This
             is
               a test"""
    "    This\nis\n  a test"

    Next, if the opening """ is followed by a newline, the newline is stripped from the resulting string.

    """hello"""

    is equivalent to

    """
    hello"""

    but

    """
    
    hello"""

    will contain a literal newline at the beginning.

    Stripping of the newline is performed after the dedentation. For example:

    julia> """
             Hello,
             world."""
    "Hello,\nworld."

    If the newline is removed using a backslash, dedentation will be respected as well:

    julia> """
             Averylong\
             word"""
    "Averylongword"

    Trailing whitespace is left unaltered.

    Triple-quoted string literals can contain " characters without escaping.

    Note that line breaks in literal strings, whether single- or triple-quoted, result in a newline (LF) character \n in the string, even if your editor uses a carriage return \r (CR) or CRLF combination to end lines. To include a CR in a string, use an explicit escape \r; for example, you can enter the literal string "a CRLF line ending\r\n".

    Common Operations

    You can lexicographically compare strings using the standard comparison operators:

    julia> "abracadabra" < "xylophone"
    true
    
    julia> "abracadabra" == "xylophone"
    false
    
    julia> "Hello, world." != "Goodbye, world."
    true
    
    julia> "1 + 2 = 3" == "1 + 2 = $(1 + 2)"
    true

    You can search for the index of a particular character using the findfirst and findlast functions:

    julia> findfirst('o', "xylophone")
    4
    
    julia> findlast('o', "xylophone")
    7
    
    julia> findfirst('z', "xylophone")

    You can start the search for a character at a given offset by using the functions findnext and findprev:

    julia> findnext('o', "xylophone", 1)
    4
    
    julia> findnext('o', "xylophone", 5)
    7
    
    julia> findprev('o', "xylophone", 5)
    4
    
    julia> findnext('o', "xylophone", 8)

    You can use the occursin function to check if a substring is found within a string:

    julia> occursin("world", "Hello, world.")
    true
    
    julia> occursin("o", "Xylophon")
    true
    
    julia> occursin("a", "Xylophon")
    false
    
    julia> occursin('o', "Xylophon")
    true

    The last example shows that occursin can also look for a character literal.

    Two other handy string functions are repeat and join:

    julia> repeat(".:Z:.", 10)
    ".:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:."
    
    julia> join(["apples", "bananas", "pineapples"], ", ", " and ")
    "apples, bananas and pineapples"

    Some other useful functions include: