Build SysY Lex Analyzer With FLex

Tip

通过本文，你将了解：

FLex基本操作
如何通过FLex编写一个简单的词法分析器

Note

此词法分析器将作为一个语法分析器的部分，因此输出仅为一系列的Token与值，并不执行其他操作。
此外，本词法分析器基于SysY语法，它是c语言的一个子集。
本文将收录于杭电编译原理实验报告

前言

GNU FLex 介绍

点击查看FLex介绍

Flex的前身是Lex。Lex是1975年由Mike Lesk和当时还在AT&T做暑期实习的Eric Schmidt，共同完成的一款基于Unix环境的词法分析程序生成工具。虽然Lex很出名并被广泛使用，但它的低效和诸多问题也使其颇受诟病。后来伯克利实验室的Vern Paxson使用C语言重写Lex，并将这个新的程序命名为Flex(意为Fast Lexical Analyzer Generator)。无论在效率上还是在稳定性上，Flex都远远好于它的前辈Lex。我们在Linux下使用的是Flex在GNU License下的版本，称作GNU Flex。

FLex安装与初步使用

在Linux中，安装FLex十分简单，只需要sudo apt-get install flex即可。

然后我们创建一个新文件scanner.l，并在其中输入以下内容：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
%%
[0-9]+  printf("?");
#       return 0;
.       ECHO;
%%

int main(int argc, char* argv[]) {
    yylex();
    return 0;
}

int yywrap() { 
    return 1;
}

使用指令flex scanner.l编译程序，可以看到目录下出现了一个名为 lex.yy.c的文件，我们再使用gcc -o scanner lex.yy.c，即可得到可执行文件./scanner。
我们试着运行一下./scanner并输入：

1
2
3
4
5
6
7
$ ./scanner
afeafewa
afeafewa
1233
?
aaaa111
aaaa?

可以观察到这段代码将数字部分都替换为了?，其他部分不变。

FLex语法浅析

在介绍FLex语法前，先解释上文程序工作原理。
观察代码，可以看到[0-9]+ printf("?");，这行代码匹配输入中包含1个或多个数字，接收到数字串后输出一个?。
程序执行时，逐字读取输入，并自上而下尝试匹配定义的规则，并执行匹配规则后方的代码。

在本程序中，如果是数字，会被[0-9]+匹配，并输出?，而其它字符将被. ECHO;匹配，输出该字符本身。因此上文的代码才能实现将数字部分都替换为了?，其他部分不变的功能。

FLex语法可以分为三部分：

1
2
3
4
5
{definitions}
%%
{rules}
%%
{user subroutines}

其中definitions部分定义了一系列别名，格式为name definitions。例如br [ \t]+就是定义br为连续的一个或多个空格或制表符，在rules中，直接可以使用br代替[ \t]+进行匹配。
这就使规则定义更加灵活，例如：

1
2
3
identifier {identifier-nondigit}({identifier-nondigit}|{identifier-digit})*
identifier-digit [0-9]
identifier-nondigit [_a-zA-Z]

这就定义了一个identifier应该有的格式：一个非数字符号开头、有0个或多个字符、下划线重复组成的字符串。

rules部分格式为re/name actions，例如[0-9]+ printf("?");，FLex将自上而下对输入字符进行匹配，并执行对应的代码。

而user subroutines将会被原封不动的复制进入lex.yy.c中，你可以在这里定义一系列自定义函数与main函数。

SysY介绍

你可以在此处找到详细的SysY 2022文法的定义。

实现

Note

完整代码在完整代码小节中。

词法分析器接收字符串组序列，输出字符符号序列并识别可能的词法错误，将其交给语法分析器。
因此，我们在词法分析部分只需简单的用正则表达式匹配字符集中的各种模式和关键词即可。

由于FLex是自上而下匹配，因此我们可以在末尾使用.通配符匹配所有未在符号集中的单词，并抛出词法错误。
并且，因为我们还未接入语法分析器（语法分析器部分将收录于编译原理实验四），我们只需简单的打印匹配结果验证词法分析结果即可。

definitions部分

1
2
3
4
identifier {identifier-nondigit}({identifier-nondigit}|{identifier-digit})*
identifier-digit [0-9]
identifier-nondigit [_a-zA-Z]
br [ \t]+

rules部分

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

{br} {;}
"/\*" {
    char curr,prev;
    prev = '\0';
    curr = input();
    while(prev!='*'||curr!='/') {prev = curr; curr = input();}
}
"//" {
    printf("<COMMENT>\n");
    char curr;
    curr = input();
    while(curr!='\n') curr = input();
}
"int" {printf("<INT>\n");}
/* 省略部分匹配代码，可在完整代码中查看 */
"}" {printf("<}>\n");}

\n      { ++yylineno; }

"0"[0-9a-zA-Z]* {
    int pos = 1;
    char c,buf[2];
    c = yytext[pos++];
    if (c!='x'&&c!='X'){
        int total=0;
/* 省略部分进制转换代码，可在完整代码中查看  */
}
{identifier} {return IDENTIFIER;}
[0-9]+ {printf("<INTNUM, %d>\n",atoi(yytext));}

. { printf("Error at Line %d: Invalid characters \'%s\'\n", yylineno, yytext);

 }

可以看到，对于大部分内容，仅仅是简单的匹配并输出，而对于整型数字，则需要将其解析为10进制值并输出。

值得注意的是，为了实现词法分析以及之后工作的结果与行号对应，我们在匹配到回车符时将yylineno加一\n { ++yylineno; }。yylineno是FLex定义的一个全局变量，代表当前处理行号。

user subrouting部分

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

int main(int argc, char** argv) {
    if (argc > 1) {
        yyin = fopen(argv[1], "r");
        if (!yyin) {
            perror("Error opening file");
            return 1;
        }
    } else {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }

    while (yylex() != 0);
    
    fclose(yyin); 

    return 0;
}

int yywrap() { 
    return 1;
}

这部分实现了从文件读取代码。

完整代码

点击此处查看完整代码

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
identifier {identifier-nondigit}({identifier-nondigit}|{identifier-digit})*
identifier-digit [0-9]
identifier-nondigit [_a-zA-Z]
br [ \t]+


%%
{br} {;}
"/\*" {
    printf("<COMMENT>\n");
    char curr,prev;
    prev = '\0';
    curr = input();
    while(prev!='*'||curr!='/') {prev = curr; curr = input();}
}
"//" {
    printf("<COMMENT>\n");
    char curr;
    curr = input();
    while(curr!='\n') curr = input();
}
"int" {printf("<INT>\n");}
"float" {printf("<FLOAT>\n");}
"void" {printf("<VOID>\n");}

"if" {printf("<IF>\n");}
"else" {printf("<ELSE>\n");}
"while" {printf("<WHILE>\n");}
"break"  {printf("<BREAK>\n");}
"return" {printf("<RETURN>\n");}


"+"    {printf("<ADD>\n");}
"-"    {printf("<SUB>\n");}
"*"    {printf("<MUL>\n");}
"/"    {printf("<DIV>\n");}
"%"    {printf("<MOD>\n");}

"!"    {printf("<NOT>\n");}

","    {printf("<,>\n");}
"="    {printf("<=>\n");}
";"    {printf("<;>\n");}

">=" {printf("<GE>\n");}
"<=" {printf("<LE>\n");}
"==" {printf("<EQ>\n");}
"!=" {printf("<NE>\n");}
">" {printf("<GT>\n");}
"<" {printf("<LT>\n");}

"||" {printf("<OR>\n");}
"&&" {printf("<AND>\n");}

"(" {printf("<(>\n");}
")" {printf("<)>\n");}
"[" {printf("<[>\n");}
"]" {printf("<]>\n");}
"{" {printf("<{>\n");}
"}" {printf("<}>\n");}

\n      { ++yylineno; }

"0"[0-9a-zA-Z]* {
    int pos = 1;
    char c,buf[2];
    c = yytext[pos++];
    if (c!='x'&&c!='X'){
        int total=0;
        while (c>='0'&&c<='7'){
            if (c>'9'||c<'0'){
                printf("Error at Line %d: Invalid characters \'%s\'\n", yylineno, yytext);
                return -1;
            }
            buf[0]=c;
            buf[1]='\0';
            total = total*8+atoi(buf);
            c = yytext[pos++];
        }
        if (c!=' '&&c!='\t'&&c!='\n'&&c!='\0'){
            printf("Error at Line %d: Invalid octalnumber \'%s\'\n", yylineno, yytext);
            return -1;
        }
        printf("<INTNUM, %d>\n", total);
        return 1;
    } else {
        c = yytext[pos++];
        int total = 0;
        while ((c>='0'&&c<='9')||(c>='a'&&c<='f')||(c>='A'&&c<='F')){
            if (c>='0'&&c<='9'){
                total = total*16 + c-'0';
            }
            if (c>='a'&&c<='f'){
                total = total*16 + c-'a'+10;
            }
            if (c>='A'&&c<='F'){
                total = total*16 + c-'A'+10;
            }
            c = yytext[pos++];
        }
        if (c!=' '&&c!='\t'&&c!='\n'&&c!='\0'){
            printf("Error at Line %d: Invalid hex number \'%s\'\n", yylineno, yytext);
            return -1;
        }
        printf("<INTNUM, %d>\n", total);
    }
}
{identifier} {return IDENTIFIER;}
[0-9]+ {printf("<INTNUM, %d>\n",atoi(yytext));}

. { printf("Error at Line %d: Invalid characters \'%s\'\n", yylineno, yytext);

 }

%%

int main(int argc, char** argv) {
    if (argc > 1) {
        yyin = fopen(argv[1], "r");
        if (!yyin) {
            perror("Error opening file");
            return 1;
        }
    } else {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }

    while (yylex() != 0);
    
    fclose(yyin);  // Remember to close the file when done

    return 0;
}

int yywrap() { 
    return 1;
}

总结

本文实现了一个简单的SysY词法分析器，并且将在下一篇文章中利用Bison+FLex结合此词法分析器进行语法分析器的制作。

使用FLex编写SysY词法分析器

本文介绍了FLex基本语法并提供了一个最小实践。之后提供了FLex搭建SysY词法分析器的具体方法。

前言