JavaScript Lexer Explained: How Your Code Becomes Tokens

Every developer has seen this at least once:
SyntaxError: Unexpected token at line 32, column 2

JavaScript isn’t running your code yet.
So how does it already know exactly where things went wrong?

The answer is the lexer - and it's the first thing that runs every single time your code executes.

How does JavaScript make sense of raw text?

It doesn’t, at least not directly.

Hit save on your code file and your IDE writes raw bytes to disk. Those bytes are decoded into a character stream by the JavaScript engine.

Consider an example => let x = 10 + 5;

Right now, this is still just raw text.

Before any programming language can understand your code, something has to give it structure - that's the lexer.

What does a lexer actually do?

A lexer reads each character and groups those characters into tokens - labelled meaningful units.

The same line, after lexing:

Value	Type
`let`	Keyword
`x`	Identifier
`=`	Operator
`10`	Number
`+`	Operator
`5`	Number

Each token isn't just a string value — it's a structured object with metadata:

{
  type:  "Keyword",
  value: "let",
  start: 0,       // byte offset where this token starts
  end:   3,       // byte offset where it ends
  line:  1,
  col:   1
}

Those line and col fields are exactly what ends up in your error message.

How does a lexer work?

Phase 1: Scanning

The lexer maintains a cursor - a pointer to its current position in the source string. It reads one character at a time and advances the cursor. No decisions yet, just reading and moving.

Phase 2: Pattern matching

Once the lexer reads a character, it needs to decide: what kind of token am I looking at? It does this using a state machine (a set of rules like "if you see a letter, start reading a word", "if you see a digit, start reading a number", "if you see a quote, start reading a string." ).

The table below is an example of what those rules look like in practice. The decision is made on the first character:

First character	What the lexer does
`a–z`, `A–Z`, `_`	Start reading an identifier or keyword
`0–9`	Start reading a number
`"` or `'`	Start reading a string
`+`, `=`, `*`...	Start reading an operator
`(`, `)`, `;`...	Emit a punctuation token immediately
, `\t`, `\n`	Skip — whitespace is discarded - not included in the tokens

The lexer then keeps consuming characters as long as they still fit the current pattern. This is called the maximal munch rule, always take the longest possible match.

A practical example with ===:

a === b

When the lexer hits the first =, it doesn't emit a token immediately. It peeks ahead:

Next char is = → still valid, keep going → ==
Next char is = → still valid, keep going → ===
Next char is → doesn't fit, stop → emit ===

Without this rule, === could be misread as = then ==. Maximal munch eliminates that ambiguity.

The same rule is why this JavaScript breaks:

1.toString()   // SyntaxError! This isn’t a runtime error — it fails during tokenization itself.

The lexer sees 1. and greedily assumes you're writing a float like 1.5. It expects more digits , but then hits t, which isn't a digit, and breaks. The fix:

(1).toString()  // works - lexer never sees 1.
1..toString()   // works - two dots: 1. is the float, second . is member access

Phase 3: Emission

When the lexer reaches a character that doesn't fit the current pattern, the current token is complete. It emits the token and resets its internal state, and loops back to Phase 1 ready for the next token.

Input:  l  e  t     x
State:  START → IDENT → IDENT → IDENT → EMIT → START
Buffer: ""  →  "l" →  "le" →  "let" →  emit {type:"Keyword", value:"let"}

Who defines the rules of the lexer?

There are two things that need to be defined before a lexer can work:

1. The token types themselves. Someone has to decide that Keyword, Identifier, Operator, Number, String, and Punctuation are the categories that exist. These aren't discovered automatically, a lexer author declares them, usually as a simple enum or set of constants:

const TokenType = {
  Keyword:     'Keyword',
  Identifier:  'Identifier',
  Number:      'Number',
  String:      'String',
  Operator:    'Operator',
  Punctuation: 'Punctuation',
}

2. What belongs in each type. Someone has to decide that let is a keyword, that identifiers start with a letter or underscore, that === is a valid operator. That someone is the language specification, ECMAScript in JavaScript's case.

The lexer author reads the spec and translates it into code. The keyword list, for example, is literally a hardcoded set:

const KEYWORDS = new Set([
  'let', 'const', 'var', 'if', 'else',
  'return', 'function', 'for', 'while', 'class', ...
])

This is also why different JavaScript parsers: Babel, TypeScript, Acorn, can have slightly different internal token type names while still parsing the same language correctly.

Let's consider one example:

For the let keyword, it scans as follows -

l - an identifier
e - still an identifier
t - still an identifier
- whitespace triggered, scanned character let , it is present in the list of keywords => ok a keyword

So how does it know it's line 32, column 2?

The lexer tracks line and col as it scans. Every time it sees a newline character (\n), it increments the line counter and resets the column counter to 1. Every other character increments the column. It does this continuously, for every single character, across the entire file, not just when something goes wrong.

So when it hits a character it doesn't recognise, say, an unexpected @ on line 32, it already knows exactly where it is:

line 31: const total = price + tax;\n   ← lexer sees \n, line becomes 32, col resets to 1
line 32: @invalid                       ← lexer reads @, col is 2, doesn't recognise it
         ^
         SyntaxError: Unexpected token at line 32, column 2

Every tool you use, Babel, TypeScript, ESLint - starts with this exact step. They all depend on tokenizing your code correctly before doing anything else.

At this point, your code is no longer raw text.
It’s structured, labeled, and fully traceable.
Now it’s ready for the next step, the parser.

— Abhigna
Console Diaries — a developer’s notes
Connect with me on LinkedIn: a6h1gna

How JavaScript Pinpoints Errors: It Starts with the Lexer

How does JavaScript make sense of raw text?

What does a lexer actually do?

How does a lexer work?

Phase 1: Scanning

Phase 2: Pattern matching

Phase 3: Emission

Who defines the rules of the lexer?

So how does it know it's line 32, column 2?

Comments

More from this blog

Why AWS Asks You to Choose a Region Every Time You Deploy

Your API doesn't need GraphQL

The Algorithmic Trick That Makes Search Engines Fast

One API for Web and Mobile? That’s the Problem...

Command Palette

How does JavaScript make sense of raw text?

What does a lexer actually do?

How does a lexer work?

Phase 1: Scanning

Phase 2: Pattern matching

Phase 3: Emission

Who defines the rules of the lexer?

So how does it know it's line 32, column 2?

Comments

More from this blog