116

In Compiler Construction by Aho Ullman and Sethi, it is given that the input string of characters of the source program are divided into sequence of characters that have a logical meaning, and are known as tokens and lexemes are sequences that make up the token so what is the basic difference?

3
  • 2
    > Compiler Construction by Aho Ullman and Sethi - Is this the same book as the dragon book? I'm having trouble locating a book by this name?
    – Drazisil
    Commented Apr 21, 2022 at 2:09
  • 1
    The "Dragon Book" is Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Commented Feb 27, 2023 at 15:37
  • It was probably a sloppy reference to the "Dragon Book" (e.g., the auhors' names were not capitalised in the first revision). It also left out crucial punctuation. Commented Feb 27, 2023 at 15:39

14 Answers 14

154

Using "Purple Dragon Book,

Lexeme pg. 111

A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

Token pg. 111

A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes.

Pattern pg. 111

A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is more complex structure that is matched by many strings.

Figure 3.2: Examplesof tokens pg.112

[Token]       [Informal Description]                  [Sample Lexemes]
if            characters i, f                         if
else          characters e, l, s, e                   else
comparison    < or > or <= or >= or == or !=          <=, !=
id            letter followed by letters and digits   pi, score, D2
number        any numeric constant                    3.14159, 0, 6.02e23
literal       anything but ", surrounded by "'s       "core dumped"

To better understand this relation to a lexer and parser we will start with the parser and work backwards to the input.

To make it easier to design a parser, a parser does not work with the input directly but takes in a list of tokens generated by a lexer. Looking at the token column in Figure 3.2 we see tokens such as if, else, comparison, id, number and literal; these are names of tokens. Typically with a lexer/parser a token is a structure that holds not only the name of the token, but the characters/symbols that make up the token and the start and end position of the string of characters that make up the token, with the start and end position being used for error reporting, highlighting, etc.

Now the lexer takes the input of characters/symbols and using the rules of the lexer converts the input characters/symbols into tokens. Now people who work with lexer/parser have their own words for things they use often. What you think of as a sequence of characters/symbols that make up a token are what people who use lexer/parsers call lexeme. So when you see lexeme, just think of a sequence of characters/symbols representing a token. In the comparison example, the sequence of characters/symbols can be different patterns such as < or > or else or 3.14, etc.

Another way to think of the relation between the two is that a token is a programming structure used by the parser that has a property called lexeme that holds the character/symbols from the input. Now if you look at most definitions of token in code you may not see lexeme as one of the properties of the token. This is because a token will more likely hold the start and end position of the characters/symbols that represent the token and the lexeme, sequence of characters/symbols can be derived from the start and end position as needed because the input is static.

6
  • 16
    In colloquial compiler usage, people tend to use the two terms interchangeably. The precise distinction is nice, if and when you need it.
    – Ira Baxter
    Commented Feb 19, 2013 at 13:59
  • 1
    While not a purely computer science definition, here is one from natural language processing that is of relevance from Introduction to lexical semantics an individual entry in the lexicon
    – Guy Coder
    Commented Mar 3, 2017 at 15:47
  • 1
    Absolutely clear explanation. This is how the things should be explained in the heaven. Commented Feb 19, 2020 at 10:09
  • 1
    great explanation. I have one more doubt, I also read about parsing stage, parser asks for tokens from lexical analyzer, as parser can not validate tokens. can you please explain by taking simple input at parser stage and when does parser asks for tokens from lexer.
    – Prasanna
    Commented Sep 9, 2020 at 20:11
  • 1
    @GuyCoder, I have posted my question where I want some information about symbol table. Can you please look into this once. https://stackoverflow.com/questions/63820620/what-values-are-stored-into-symbol-table-in-compiler-construction. i will also post above question in different thread. Thank you
    – Prasanna
    Commented Sep 9, 2020 at 23:10
45

When a source program is fed into the lexical analyzer, it begins by breaking up the characters into sequences of lexemes. The lexemes are then used in the construction of tokens, in which the lexemes are mapped into tokens. A variable called myVar would be mapped into a token stating <id, "num">, where "num" should point to the variable's location in the symbol table.

Shortly put:

  • Lexemes are the words derived from the character input stream.
  • Tokens are lexemes mapped into a token-name and an attribute-value.


An example includes:
x = a + b * 2
Which yields the lexemes: {x, =, a, +, b, *, 2}
With corresponding tokens: {<id, 0>, <=>, <id, 1>, <+>, <id, 2>, <*>, <id, 3>}

2
  • 3
    Is it supposed to be <id, 3> ? because 2 is a not an identifier
    – Aditya
    Commented Dec 5, 2018 at 11:24
  • but where does it says that x is an identifier? does that mean a symbol table is a 3 column table having 'name'=x , 'type' ='identifier(id)', pointer ='0' as a particular entry?then it must have some other entry like 'name'=while, 'type' ='keyword', pointer ='21' ??
    – user4831795
    Commented Feb 22, 2019 at 23:52
11

Lexeme- A lexeme is a string of character that is the lowest level syntactic unit in the programming language.

Token- The token is a syntactic category that forms a class of lexemes that means which class the lexeme belong is it a keyword or identifier or anything else. One of the major tasks of the lexical analyzer is to create a pair of lexemes and tokens, that is to collect all the characters.

Let us take an example:-

if(y<= t)

y=y-3;

Lexeme                      Token

if                                       KEYWORD

(                                 LEFT PARENTHESIS

y                                     IDENTIFIER

< =                                 COMPARISON

t                                     IDENTIFIER

)                                RIGHT PARENTHESIS

y                                    IDENTIFIER

=                                   ASSGNMENT

y                                   IDENTIFIER

_                                   ARITHMATIC

3                                     INTEGER

;                                    SEMICOLON

Relation between Lexeme and Token

1
  • This explanation is so straightforward! Thanks for the sharing
    – LomoY
    Commented Jun 22, 2022 at 2:25
8

LEXEME - Sequence of characters matched by PATTERN forming the TOKEN

PATTERN - The set of rule that define a TOKEN

TOKEN - The meaningful collection of characters over the character set of the programming language ex:ID, Constant, Keywords, Operators, Punctuation, Literal String

7

a) Tokens are symbolic names for the entities that make up the text of the program; e.g. if for the keyword if, and id for any identifier. These make up the output of the lexical analyser. 5

(b) A pattern is a rule that specifies when a sequence of characters from the input constitutes a token; e.g the sequence i, f for the token if , and any sequence of alphanumerics starting with a letter for the token id.

(c) A lexeme is a sequence of characters from the input that match a pattern (and hence constitute an instance of a token); for example if matches the pattern for if , and foo123bar matches the pattern for id.

7

Lexeme - A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

Token - Token is a pair consisting of a token name and an optional token value. The token name is a category of a lexical unit.Common token names are

  • identifiers: names the programmer chooses
  • keywords: names already in the programming language
  • separators (also known as punctuators): punctuation characters and paired-delimiters
  • operators: symbols that operate on arguments and produce results
  • literals: numeric, logical, textual, reference literals

Consider this expression in the programming language C:

sum = 3 + 2;

Tokenized and represented by the following table:

 Lexeme        Token category
------------------------------
sum      |    Identifier
 =       |    Assignment operator
 3       |    Integer literal
 +       |    Addition operator
 2       |    Integer literal
 ;       |    End of statement
5

Let's see the working of a lexical analyser (also called a scanner).

Let's take an example expression:

Input: cout << 3+2+3;

Formatting performed by the scanner: {cout}|space|{<<}|space|{3}{+}{2}{+}{3}{;}

It is not the actual output though.

The scanner simply looks repeatedly for a lexeme in the source-program text until the input is exhausted.

A lexeme is a substring of input that forms a valid string-of-terminals present in the grammar. Every lexeme follows a pattern which is explained at the end (the part that the reader may skip at last).

(An important rule is to look for the longest possible prefix forming a valid string-of-terminals until next whitespace is encountered... explained below)

Lexemes:

  1. cout
  2. <<

(although "<" is also a valid terminal-string, but the above mentioned rule shall select the pattern for lexeme "<<" in order to generate a token returned by the scanner)

  1. 3
  2. 2
  3. ;

Tokens: Tokens are returned one at a time (by the scanner when requested by the parser) each time the scanner finds a (valid) lexeme. Scanner creates, if not already present, a symbol-table entry (having attributes: mainly token-category and few others), when it finds a lexeme, in order to generate its token.

'#' denotes a symbol table entry. I have pointed to the lexeme number in the above list for ease of understanding, but it technically should be the actual index of a record in the symbol table.

The following tokens are returned by the scanner to the parser in the specified order for the above example.

  1. < identifier, #1 >

  2. < Operator, #2 >

  3. < Literal, #3 >

  4. < Operator, #4 >

  5. < Literal, #5 >

  6. < Operator, #4 >

  7. < Literal, #3 >

  8. < Punctuator, #6 >

As you can see the difference, a token is a pair, unlike a lexeme which is a substring of the input.

And the first element of the pair is the token-class/category.

Token classes are listed below:

  • KEYWORDS
  • IDENTIFIERS
  • LITERALS
  • PUNCTUATORS
  • OPERATORS

And one more thing: the scanner detects whitespaces. It ignores them and does not form any token for a whitespace at all. Not all delimiters are whitespaces; a whitespace is one form of delimiter used by scanners for its purpose. Tabs, newlines, spaces, escaped characters in the input are all collectively called Whitespace delimiters. A few other delimiters are ';', ',', ':', etc., which are widely recognised as lexemes that form tokens.

The total number of tokens returned are 8 here. However, only 6 symbol table entries are made for lexemes. Lexemes are also 8 in total (see definition of a lexeme).

You can skip this part

A pattern is a rule (say, a regular expression) that is used to check if a string-of-terminals is valid or not.

If a substring of input composed only of grammar terminals is following the rule specified by any of the listed patterns, it is validated as a lexeme and selected pattern will identify the category of lexeme, else a lexical error is reported due to either (i) not following any of the rules or (ii) input consists of a bad terminal-character not present in the grammar itself.

For example:

  1. No pattern exists: In C++, "99Id_Var" is grammar-supported string-of-terminals, but it is not recognised by any of patterns, hence a lexical error is reported.

  2. Bad input character: $, @, and Unicode characters may not be supported as a valid character in a few programming languages.

1
  • "the scanner detects whitespaces. It ignores them and does not form any token for a whitespace at all." — this is a great example of tokens not always representing input 1:1 but being what is convenient to feed into the parser. Commented Jun 21, 2023 at 10:38
4

Token: The kind for (keywords,identifier,punctuation character, multi-character operators) is ,simply, a Token.

Pattern: A rule for formation of token from input characters.

Lexeme : Its a sequence of characters in SOURCE PROGRAM matched by a pattern for a token. Basically, its an element of Token.

4

Token: Token is a sequence of characters that can be treated as a single logical entity. Typical tokens are,
1) Identifiers
2) keywords
3) operators
4) special symbols
5)constants

Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.

2

Computer science researchers, as those from mathematics, are fond of creating "new" terms.

The previous answers are all nice, but apparently, there isn't any such great need to distinguish tokens and lexemes, IMHO. They are like two ways to represent the same thing.

A lexeme is concrete—here is a set of characters; a token, on the other hand, is abstract—usually referring to the type of a lexeme together with its semantic value if that makes sense. Just my two cents.

1

It's useful to remember the motivation why tokenizer/parser separation is widely practiced. I'm over-simplifying, but:

  • Many of the efficient (= time proportional to input length) algorithms for parsing couldn't deal with arbitrary Context-Free Grammars but require some bounded lookahead e.g. LR(4) may only peek ahead 4 characters.

  • Alas, "4 characters" is useless. How much lookahead do you need for say an addition ::= variable '+' variable rule? That already is a problem: very_looooong_variable_name+n requires you to look ahead until you find the + operator — but we really want to allow arbitrarily-long variable names. And arbitrarily long string literals etc.

  • The solution was not run the parser on characters! First we pre-process conceptually-simple but long sequences, like identifiers & strings, into single "tokens" — then we can get away with limits like "2 tokens lookahead" in many languages.

It happened that this idea roughly corresponded to word / grammar levels in natural languages, so "lexeme" term was borrowed, but at least for "token" term the intent is really "whatever is convenient to feed into the parser stage".

For example, Python's indentation-sensitive grammar is not formally context-free at all, but that's alright — the tokenizer converts indendation changes into "INDENT"/"DEDENT" tokens.
You can't exactly point to which characters form a "DEDENT lexeme" (it's actually absence of space characters :-) but for the parser, DEDENT is a real token, same as a Java parser would treat a CLOSING_BRACE token!
After such tokenizer, everything was not just CFG but parsable by LL(1) algorithm [well, until recently].

0

Lexeme Lexemes are said to be a sequence of characters (alphanumeric) in a token.

Token A token is a sequence of characters that can be identified as a single logical entity . Typically tokens are keywords, identifiers, constants, strings, punctuation symbols, operators. numbers.

Pattern A set of strings described by rule called pattern. A pattern explains what can be a token and these patterns are defined by means of regular expressions, that are associated with the token.

-1

Lexical Analyzer takes a sequence of characters identifies a lexeme that matches the regular expression and further categorizes it to token. Thus, a Lexeme is matched string and a Token name is the category of that lexeme.

For example, consider below regular expression for an identifier with input "int foo, bar;"

letter(letter|digit|_)*

Here, foo and bar match the regular expression thus are both lexemes but are categorized as one token ID i.e identifier.

Also note, next phase i.e syntax analyzer need not have to know about lexeme but a token.

-2

Lexeme is basically the unit of a token and it is basically sequence of characters that matches the token and helps to break the source code into tokens.

For example: If the source is x=b, then the lexemes would be x, =, b and the tokens would be <id, 0>, <=>, <id, 1>.

1
  • An answer should be more specific. An example could be useful. Commented May 15, 2016 at 6:22

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.