Python lexical analysis - logical line & compound statements

Pythons grammar

Fortunately there is a Full Grammar specification in the Python documentation.

A statement is defined in that specification as:

stmt: simple_stmt | compound_stmt

And a logical line is delimited by NEWLINE (that's not in the specification but based on your question).

Step-by-step

Okay, let's go through this, what's the specification for a

simple_stmt:

simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
             import_stmt | global_stmt | nonlocal_stmt | assert_stmt)

Okay now it goes into several different paths and it probably doesn't make sense to go through all of them separately but based on the specification a simple_stmt could cross logical line boundaries if any of the small_stmts contains a NEWLINE (currently they don't but could).

Apart from that only theoretical possibility there is actually the

compound_stmt:

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
[...]
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
[...]
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

I picked only the if statement and suite because it already suffices. The if statement including elif and else and all of the content in these is one statement (a compound statement). And because it may contain NEWLINEs (if the suite isn't just a simple_stmt) it already fulfills the requirement of "a statement that crosses logical line boundaries".

An example if (schematic):

if 1:
    100
    200

would be:

if_stmt
|---> test        --> 1
|---> NEWLINE
|---> INDENT
|---> expr_stmt   --> 100
|---> NEWLINE
|---> expr_stmt   --> 200
|---> NEWLINE
|---> DEDENT

And all of this belongs to the if statement (and it's not just a block "controlled" by the if or while, ...).

The same if with parser, symbol and token

A way to visualize that would be using the built-in parser, token and symbol modules (really, I haven't known about this modules before I wrote the answer):

import symbol
import parser
import token

s = """
if 1:
    100
    200
"""
st = parser.suite(s)

def recursive_print(inp, level=0):
    for idx, item in enumerate(inp):
        if isinstance(item, int):
            print('.'*level, symbol.sym_name.get(item, token.tok_name.get(item, item)), sep="")
        elif isinstance(item, list):
            recursive_print(item, level+1)
        else:
            print('.'*level, repr(item), sep="")

recursive_print(st.tolist())

Actually I cannot explain most of the parser result but it shows (if you remove a lot of unnecessary lines) that the suite including it's newlines really belongs to the if_stmt. Indentation represents the "depth" of the parser at a specific point.

file_input
.stmt
..compound_stmt
...if_stmt
....NAME
....'if'
....test
.........expr
...................NUMBER
...................'1'
....COLON
....suite
.....NEWLINE
.....INDENT
.....stmt
...............expr
.........................NUMBER
.........................'100'
.......NEWLINE
.....stmt
...............expr
.........................NUMBER
.........................'200'
.......NEWLINE
.....DEDENT
.NEWLINE
.ENDMARKER

That could probably be made much more beautiful but I hope it serves as illustration even in it's current form.


It's simpler than you think. A compound statement is considered a single statement, even though it may have other statements inside. Quoting the docs:

Compound statements contain (groups of) other statements; they affect or control the execution of those other statements in some way. In general, compound statements span multiple lines, although in simple incarnations a whole compound statement may be contained in one line.

For example,

if a < b:
    do_thing()
    do_other_thing()

is a single if statement occupying 3 logical lines. That's how a statement can cross logical line boundaries.