Edgar Huckert: Computational Linguistics


A context free parser for natural languages

This is based on very old code that I wrote around the early 80s when I left the university of Heidelberg after having worked for Prof. Klaus Brockhaus (see my remarks below). The first C compiler I used then was Ron Cains "Small C" for 8080-based processors. The program is totally command-line oriented - no fancy graphics! Here is a description in PDF (by the way generated via the program txtpdf mentioned above). Features of this NL-parser include:

The performance of this small parser is not bad: on a vanilla INTEL based Linux system I analyzed a small french corpus (860 small sentences with up to 12 words per sentence) in 70 ms! This was done using my small french sample grammar with 40 rules and 120 lexicon entries. The number of lexicon entries is not important for the performance of the parser. I got similar achievements with large lexicons (more than 9000 entries).

A large part of the theoretical background presented in the above mentioned parsers has been developed by Prof. Klaus Brockhaus (1933-2011). He also wrote the very first version of this parser in PL/I - I later rewrote this in C. He worked at the universities of Münster, Heidelberg and Berlin. He was the first one to write practical and readable grammars for substantial extracts of German and English - and by "practical" I mean that the number of rules was held in a reasonable amount using his "complex notation" for categories and rule constraints.

A typical parsing result (subcategories not shown)

The parsing result shown here - a bracket notation and a parsing tree - is based on the small french grammar (40 rules) contained in the zip file. The output of the parser can be internationalized by loading language specific message files and filters. If the input sentence was ambiguous (on the base of the given grammar) then multiple bracket notations and multiple parsing trees are produced.


Dictionary maintenance

This dictionary maintenance program may to be used under Windows to maintain the dictionaries used in my context free parser for natural languages (see above). It assumes that that the entries are written in complex notation (see the parser above). It can probably used also in other contexts (linguistics). The entries must have the general form key=value (see the sample lexicon). The program includes: This program is very basic. For more sophisticated you may have to modify and enhance it.

Screen dump for program Dictmant

In my chapter "Programming in D" I have included a rather short D program (in subchapter "Input, output, directories, Unicode") that can be used as a base for a similar program. This program supposes UTF8 encoding. It has however no GUI component.


Tokenizer

A tokenizer is normally the first step in linguistic applications like parsers or spell checkers. A tokenizer analyzes text and produces isolated words from the texts. My version adds additional information: Note that this is a batch like command line version written in C++ that can/must be enhanced for the respective purpose. The zip archive contains the source and an executable for Windows. The program can also be built under Linux/Unix. There is no make file in the archive as the build process is very trivial.

Sample input and output


Copyright for all images, texts and software on this page: Dr. E. Huckert

Contact

If you want to contact me: this is my
mail address