Un analizador sintáctico en informática y linguística es un proceso que analiza secuencias de tokens para determinar su estructura gramatical respecto a una gramática formal dada.
Un parser es así mismo un programa que reconoce si una o varias cadenas de caracteres forman parte de un determinado lenguaje, es utilizado, por ejemplo, en la construcción de compiladores.
El análisis sintáctico convierte el texto de entrada en otras estructuras (comunmente árboles), que son mas útiles para el posterior análisis y capturan la jerarquía implícita de la entrada. Los analizadores léxicos crean tokens de una secuencia de caracteres de entrada y son estos tokens los que son procesados por el analizador para construir la estructura de datos, por ejemplo un árbol de análisis o árboles abstractos de sintaxis.
El análisis sintáctico también es un estado inicial del análisis de frases de lenguaje natural. Es usado para generar diagramas de lenguajes que usan flexión gramatical, como los idiomas romances o el latín. Los lenguajes habitualmente reconocidos por los analizadores sintácticos son los lenguajes libres de contexto. Cabe notar que existe una justificación formal que establece que los lenguajes libres de contexto son aquellos reconocibles por un autómata de pila, de modo que todo analizador sintáctico que reconozca un lenguaje libre de contexto es equivalente en capacidad computacional a un autómata de pila.
El compilador de compiladores que utilicé fue yacc.py, eneguida su descripción:
yacc.py is used to parse language syntax. Before showing an example, there are a few important bits of background that must be mentioned. First, syntax is usually specified in terms of a BNF grammar. For example, if you wanted to parse simple arithmetic expressions, you might first write an unambiguous grammar specification like this:
In the grammar, symbols such as NUMBER, +, -, *, and / are known as terminals and correspond to raw input tokens. Identifiers such as term and factor refer to more complex rules, typically comprised of a collection of tokens. These identifiers are known as non-terminals.
expression : expression + term
| expression - term
| term
term : term * factor
| term / factor
| factor
factor : NUMBER
| ( expression )
The semantic behavior of a language is often specified using a technique known as syntax directed translation. In syntax directed translation, attributes are attached to each symbol in a given grammar rule along with an action. Whenever a particular grammar rule is recognized, the action describes what to do. For example, given the expression grammar above, you might write the specification for a simple calculator like this:
A good way to think about syntax directed translation is to simply think of each symbol in the grammar as some kind of object. The semantics of the language are then expressed as a collection of methods/operations on these objects.
Grammar
--------------------------------
expression0 : expression1 + term
| expression1 - term
| term
term0 : term1 * factor
| term1 / factor
| factor
factor : NUMBER
| ( expression )
Yacc uses a parsing technique known as LR-parsing or shift-reduce parsing. LR parsing is a bottom up technique that tries to recognize the right-hand-side of various grammar rules. Whenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the grammar symbols are replaced by the grammar symbol on the left-hand-side.
LR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next input token for patterns. The details of the algorithm can be found in a compiler text, but the following example illustrates the steps that are performed if you wanted to parse the expression 3 + 5 * (10 - 20) using the grammar defined above:
Step Symbol Stack Input Tokens
---- --------------------- ---------------------
1 $ 3 + 5 * ( 10 - 20 )$
2 $ 3 + 5 * ( 10 - 20 )$
3 $ factor + 5 * ( 10 - 20 )$
4 $ term + 5 * ( 10 - 20 )$
5 $ expr + 5 * ( 10 - 20 )$
6 $ expr + 5 * ( 10 - 20 )$
7 $ expr + 5 * ( 10 - 20 )$
8 $ expr + factor * ( 10 - 20 )$
9 $ expr + term * ( 10 - 20 )$
10 $ expr + term * ( 10 - 20 )$
11 $ expr + term * ( 10 - 20 )$
12 $ expr + term * ( 10 - 20 )$
13 $ expr + term * ( factor - 20 )$
14 $ expr + term * ( term - 20 )$
15 $ expr + term * ( expr - 20 )$
16 $ expr + term * ( expr - 20 )$
17 $ expr + term * ( expr - 20 )$
18 $ expr + term * ( expr - factor )$
19 $ expr + term * ( expr - term )$
20 $ expr + term * ( expr )$
21 $ expr + term * ( expr ) $
22 $ expr + term * factor $
23 $ expr + term $
24 $ expr $
25 $ $
Ejemplo de un analizador sintactico de una
calculadora en PLY:
import ply.yacc as yacc
# Get the token map from the lexer.
This is required.
from calclex import tokens
def p_expression_plus(p):
'expression : expression PLUS term'
p[0] = p[1] + p[3]
def p_expression_minus(p):
'expression : expression MINUS term'
p[0] = p[1] - p[3]
def p_expression_term(p):
'expression : term'
p[0] = p[1]
def p_term_times(p):
'term : term TIMES factor'
p[0] = p[1] * p[3]
def p_term_div(p):
'term : term DIVIDE factor'
p[0] = p[1] / p[3]
def p_term_factor(p):
'term : factor'
p[0] = p[1]
def p_factor_num(p):
'factor : NUMBER'
p[0] = p[1]
def p_factor_expr(p):
'factor : LPAREN expression RPAREN'
p[0] = p[2]
# Error rule for syntax errors
def p_error(p):
print "Syntax error in input!"
# Build the parser
yacc.yacc()
# Use this if you want to build the parser using SLR instead of LALR
# yacc.yacc(method="SLR")
while 1:
try:
s = raw_input('calc > ')
except EOFError:
break
if not s: continue
result = yacc.parse(s)
print result