yijun@cs.toronto.edu
edh@elis.rug.ac.be
jlu@cs.uwindsor.ca
This document describes the YAXX project version 0.10.
The latest version of this manual can be found at http://yaxx.sourceforge.net.
The YAXX tool is an XML extension to YACC, a well-known compiler-compiler. Given a program and its YACC grammar, YAXX outputs the internal parse tree of the program as an XML document. The generated XML conforms to the DTD that is also generated from the YACC grammar and it can be transformed back to a legal program through XSLT.
What YAXX includes
Extension to Bison, a variant of YACC.
Example grammars for ANSI C, Fortran 77 (from the Fortran parallelizing transformer FPT), SQL (from John R. Levine, et al book of Lex and Yacc, 2nd Edition), Java 1.1 and XML itself.
The input to LEX/YACC are two files, one for the lexical analysis (LEX), the other for the syntax analysis (YACC). LEX is responsible for converting the program into a stream of tokens, and YACC is responsible for generating a parser from a set of production rules to construct a parse tree with both terminals and non-terminals. A terminal is a leaf of the parse tree which corresponds to the LEX tokens, while a non-terminal is the intermediate node of the parse tree that corresponds to one of left hand side (LHS) of a production rule, and it often has children which correspond to the right hand sides (RHS) of the rule. The root of the parse tree is the LHS of the start rule. The parse tree is built from-bottom-to-up by reducing the symbols popped from a stack into a single non-terminal and pushing it back to the symbol stack. Once the symbol stack is empty, the parse tree has been constructed. The decisions of when to do the reduction depends on the grammar. YACC uses LALR(1) grammar which can peek one symbol further in order to resolve conflicts by an finite state automata (FSA). Most programming languages such as ANSI C, Java, Fortran, etc. conform to the LALR(1) grammar, thus YACC has been widely used to construct context-free grammar parsers from these languages.
There are good reasons to do the extension:
Visualize the actions of a YACC parser The parse tree is an important concept, however, YAXX does not output it so one can not see the construction of it. Therefore it is useful to write the parse tree out for visualizing the work of the YACC.
Abstract Syntax Tree (AST) generation using SAX Most compilers use actions on the event of production rule reduction to construct their own Abstract Syntax Tree. Indeed not every intermediate state or non-terminals needs to be represented inside the compiler. However, since the construction of abstract syntax tree happens always at a reduction, one can use the simple API for XML (SAX) to parse the YACC parse tree when it is an XML document. Parse tree in XML enables the programming language parser to be used together with the XML parser or vice versa.
Source code extraction from XML document using XPath/XSLT Having the parse tree which contains all the syntax information, the AST can be queried using XPath expressions. The completeness of the syntax information on YAXX document can be exemplified by a simple XSLT yaxx.xsl that transforms it back the original source program. So the XML document is syntectically equivalent to the original program.
Conversions between a DTD and the YACC grammar The Document Type Definitions (DTD) declares the structure of a valid XML document, because of the similarity mentioned above, one can use YACC to generate an equivalent parser in order to verify the equivalent language of the XML document.
Extensibility made by orthogonal name spaces The YAXX document is self-contained in the name space of syntax yaxx:, while the semantic information can be added to the document using a different name space without interferring the syntax name space. It allows program analysis and transformation to bookkeep on the same document without losing information when passing on to another program analysis or transformation. It also permits clearly interfacing different programming tools in a defined way.
There are a number of decisions made in the implementation:
An element tag stack is used to record the lexical tokens and the reduced non-terminals bottom up. The elements are marked up using the tags from their role names in the production rules. Attributes can be used only for the grammar which uses the attributes, they should be avoided if it is not necessary.
During the parser generation, the DTD is output as the side-effect using the generated tables. The separate production rules are merged into one element definition using "|" and terminals are defined using the rules <!element foo (#PCDATA)> and the lexicon value of the terminals are saved as the content inside the pair of start/end tags. Non-terminals do not have text values, they contains the children correspond to the right hand side of the applied production rule.
Spatial performance is an issue only when the source program is huge. Normally the generated XML document can be 10 times larger than the original program because of the additional tags and intermediate non-terminal symbols. Since the parse tree YAXX document is equivalent to the original source code, we can pipe it to an XSLT transformer to query information and discard it once the information is obtained. If the XML document needs to be stored, one can use a compressor which significantly compress the size.
Modularity is achieved by changing the skeleton or template for the parser generation. The changes on the skeleton will not affect the normal work of YAXX.
Requirements
The version of bison on your system can be shown by:
bison --vesion
If you have not the version 1.875 or above installed, download bison-1.875 and compile it:
tar xvfz bison-1.875.tar.gz cd bison* ./configure make install
For yaxx-0.10 to work, you just need to copy the yaxx.c skeleton to the system path where bison is installed, for example /usr/local/bin/bison.
tar xvfz yaxx-0.10.tar.gz cp yaxx*/bison/data/yaxx.c /usr/share/bison
A working YACC parser has a grammar file with extension .y. In the makefile of the parser, the bison needs to be used with the new skeleton.
bison -S yaxx.c foo.y
We use an ANSI C example to show the usage of YAXX. This document is not indended to show all the examples included in the release, such as Fortran 77, Java, Ada, SQL, XML, Telos, NFR, etc., not to mention hundreds of other YACC grammars.
Input
Command
make
Output
An XML document which corresponds to the syntax tree (main.xml) and a DTD of the above XML document (ansic.dtd).
In main.xml, the name space is defined as yaxx, the root element is file which is the RHS of the starting rule in the YACC grammar.
In ansic.dtd, the non-terminals are generated as composite elements <!ELEMENT lhs (rhs...)>; the terminals are generated as leaf element with string value <!ELEMENT term (#PCDATA)>.
In this way, the main.xml can be verified against the ansic.dtd by any XML parser.
All you really need is a version of bison-1.875 and a GNU C compiler on a Unix platform.
To ease the use of YAXX, a Windows user can load the project into an IDE to view/edit/compile/test the examples. Unfortunately, Microsoft Visual C++ was not supported by the Bison developers. Therefore we suggest to use the Eclipse 2.0.2 IDE sponsored by IBM for the open source community. It supports C/C++ developments by CDT toolkit. For CDT to work, you need to install CygWin and make sure the cygwin/bin path is set to your PATH environment variable. Since CDT uses Make mechanism instead of Ant builder for the Java development, it is rather simple to integrate the legacy project makefiles.
The GNU/CC 3.4.0 grammar has been included. To get it work, an extension has been done to the yyoverflow macro in the c-parse.in source, which is used to generate the real YACC grammar file c-parse.y. In addition, a bug fix for NULL pointer test is made in the skeleton file yaxx.c.
For example, the crtstuff.c has been parsed and a 2684305 byte XML file is generated by the yaxx extended GCC, when GCC is compiled using xgcc during the second compilation phase. However, we still need to find out why the generated DTD is not completely right yet.
Installation: since GCC 3.4.0 is a big tarball (36M), we do not include it in the yaxx release. To make it work, you need to download it from the official GCC web site: http://mirrors.rcn.net/pub/sourceware/gcc/releases/gcc-3.4.0/gcc-3.4.0.tar.gz. Then replace the gcc-3.4.0/gcc/Makefile and the gcc-3.4.0/gcc/c-parse.in after configuring.
Sebastien Fricker contributed a feature to prefix the variables/functions in the generated code. To use the feature, the grammar file (.y) needs to add a statement %option prefix="foo" And the lexical file (.l) need to add some options accordingly: %option prefix="foo" %option outfile="lex.yy.c" %option noyywrap Also, be careful to rename yyparse() with "foo"parse() and rename yylex with "foo"lex. The grammar examples have been updated accordingly.
The following bugs are fixed by Sebastien Fricker: (1) Memory allocation from stack does not work for large files. (2) The replace_entity process for & is not terminating.
A complete list of grammars supported in YAXX are: Ada 95 (written by Intermetrics, Inc., contributed by Myo Thein); ANSI C (contributed by Yijun Yu); BibTeX (used by bibtex2-0.95 written by Henk Muller, contributed by Kristof Beyls); Fortran 77 (used by F2C/FPT team, contributed by Erik H. D'Hollander); Java (written by Dmitri Bronnikov, contributed by Yijun Yu); NFR (written by Yijun Yu, contributed by Yijun Yu); SQL/embedded SQL (written by J.Levine, contributed by Yijun Yu); Telos (used by Yijun Yu based on ConceptBase V5.2.3, contributed by Yijun Yu); XML(written by Jean-Jacques Girardot, contributed by Yijun Yu).
We plan to add support to Bison grammar and GCC grammars in the next release. Any help/contribution is appreciated.
The Ada95 support is added according to the user request. Two bugs have been fixed, one related to the DTD generation array bound error, the other one is enlarging the dynamic allocation size of temporary buffer.
Adapt YAXX based on Bison-1.875 instead of on Bison-1.35.
Bison-1.875 reorganized the bison.simple as a skeleton, i.e. a template that is expanded by the M4 macro languages.
We adapt the changes on the bison.simple to the corresponding skeleton yacc.c to obtain a new skeleton yaxx.c. Besides, there are also a few changes to the internal tables generated by the newer Bison version.
Such a change makes it easy to call bison using the following command bison -S yaxx.c foo.y The DTD rules and a XML-generating parser is generated. Invoking the yyparse() will generate the XML parse tree of the program.
All the examples are tested to ensure they work with the new skeleton.
The DTD element rules with the same left hand side are merged to a single element rule in order to satisfy the new XSLTPROC engine when the XSL stylesheet is applied.
Because the bison invocation is simplified (not necessary to specify the environment variable for the bison.simple template), the compile.sh are removed and it's function is replaced by Makefile.
The Windows platform is supported through the CYGWIN. The Eclipse provides an IDE for the development using CDT project. In the experiments, the Eclipse SDK 2.0.2 and CDT for Windows are first installed. Then the CYGWIN with GCC 3.2 should also be installed to make the IDE work on Windows platform.
The XML document includes a processing instruction to invoke an XSLT stylesheet yaxx.xsl that transforms it back into a valid correct program.
The Fortran lex analyzer is modified to report the value of terminals. The SLABEL terminal is treated specially to add a new line before each statement.
The text that contains XML special entities such as '&', '>' and '<' are replaced with "&", ">" and "<" respectively.
The XML grammar from Lex and Yacc course is added.
Solved problem for '+' like terminals in both XML and DTD.
The SQL grammar from J.Levine et al Lex and Yacc, 2nd edition is added.
The DTD output is produced.
The XML output is adjusted to include the DTD and name space declaration.