EH Computational Linguistics

Edgar Huckert: Computational Linguistics

A context free parser for natural languages

This is based on very old code that I wrote around the early 80s when I left the university of Heidelberg after having worked for Prof. Klaus Brockhaus (see my remarks below). The first C compiler I used then was Ron Cains "Small C" for 8080-based processors. The program is totally command-line oriented - no fancy graphics! Here is a description in PDF (by the way generated via the program txtpdf mentioned above). Features of this NL-parser include:

the included source code files (Standard ANSI C) are for Windows and Linux. The executables are for Windows 32 Bit only. Under Linux the programs must be compiled to produce executables - this is a trivial task. The main component hupars.c is very short (not more than 70KB code) and very fast. Note that the code may generate warnings when compiled under Windows or UNIX - this is C standard from the eighties. Under UNIX/Linux you may have to translate umlauts and accents und convert capital letters to small letters.
Uses external grammars and lexica in ANSI format. The notation for the rules ressembles Backus-Naur notation (or classical Chomsky normal form) enhanced by "complex notation" to cope for congruence phenomena usually found in many indo-european languages.
Can be used for many languages
Offers a simple morphological component (endings only). A more powerful morphological component will soon be presented here.
A medium size German grammar and two simpler sample grammars (English and French) are included.
I include some utility functions like a syntax checker (syncheck.c), and a sort-and-merge program (HUWbsort.c) . The dictionary maintenance program presented below can be used to build and correct dictionaries

The performance of this small parser is not bad: on a vanilla INTEL based Linux system I analyzed a small french corpus (860 small sentences with up to 12 words per sentence) in 70 ms! This was done using my small french sample grammar with 40 rules and 120 lexicon entries. The number of lexicon entries is not important for the performance of the parser. I got similar achievements with large lexicons (more than 9000 entries).

A large part of the theoretical background presented in the above mentioned parsers has been developed by Prof. Klaus Brockhaus (1933-2011). He also wrote the very first version of this parser in PL/I - I later rewrote this in C. He worked at the universities of Münster, Heidelberg and Berlin. He was the first one to write practical and readable grammars for substantial extracts of German and English - and by "practical" I mean that the number of rules was held in a reasonable amount using his "complex notation" for categories and rule constraints.

A typical parsing result (subcategories not shown)

The parsing result shown here - a bracket notation and a parsing tree - is based on the small french grammar (40 rules) contained in the zip file. The output of the parser can be internationalized by loading language specific message files and filters. If the input sentence was ambiguous (on the base of the given grammar) then multiple bracket notations and multiple parsing trees are produced.

Dictionary maintenance

This dictionary maintenance program may to be used under Windows to maintain the dictionaries used in my context free parser for natural languages (see above). It assumes that that the entries are written in complex notation (see the parser above). It can probably used also in other contexts (linguistics). The entries must have the general form key=value (see the sample lexicon). The program includes:

a pure Windows program dictmant.exe and dictmant.cpp (the source: uses pure Windows SDK, nor MFC nor any other class library)
make files for Digital Mars C++ and Microsoft C++, a definition file, a ressource file
surrounding files: an ini file, a sample dictionary (German)

This program is very basic. For more sophisticated you may have to modify and enhance it.

Screen dump for program Dictmant

In my chapter "Programming in D" I have included a rather short D program (in subchapter "Input, output, directories, Unicode") that can be used as a base for a similar program. This program supposes UTF8 encoding. It has however no GUI component.

Tokenizer

A tokenizer is normally the first step in linguistic applications like parsers or spell checkers. A tokenizer analyzes text and produces isolated words from the texts. My version adds additional information:

token coordinates (line number, start, length)
simple categories (such as "/ca=DATE") if possible in this step.

Note that this is a batch like command line version written in C++ that can/must be enhanced for the respective purpose. The zip archive contains the source and an executable for Windows. The program can also be built under Linux/Unix. There is no make file in the archive as the build process is very trivial.

Sample input and output

Contact

If you want to contact me: this is my mail address