Edgar Huckert: Computational Linguistics
A context free parser for natural languages
This is based on very old code that I wrote around the early 80s when I left the
university of Heidelberg after having worked for Prof. Klaus Brockhaus (see my remarks below).
The first C compiler I used then was Ron Cains "Small C" for 8080-based processors.
The program is totally command-line oriented -
no fancy graphics! Here is a
description in PDF
(by the way generated via the program txtpdf mentioned above).
Features of this NL-parser include:
- the included source code files (Standard ANSI C) are for Windows and Linux.
The executables are for Windows 32 Bit only.
Under Linux the programs must be compiled to produce executables - this is a trivial task. The main component
hupars.c is very short (not more than 70KB code) and very fast. Note
that the code may generate warnings when compiled under Windows or UNIX - this
is C standard from the eighties. Under UNIX/Linux you may have to translate
umlauts and accents und convert capital letters to small letters.
- Uses external grammars and lexica in ANSI format. The notation for the rules
ressembles Backus-Naur
notation (or classical Chomsky normal form) enhanced by
"complex notation" to cope for congruence phenomena
usually found in many indo-european languages.
- Can be used for many languages
- Offers a simple morphological component (endings only). A more powerful
morphological component will soon be presented here.
- A medium size German grammar and two simpler sample grammars (English and
French) are included.
- I include some utility functions like a syntax checker (syncheck.c),
and a sort-and-merge program (HUWbsort.c) .
The dictionary maintenance program presented below can be used to
build and correct dictionaries
The performance of this small parser is not bad: on a vanilla INTEL based Linux system I
analyzed a small french corpus (860 small sentences with up to 12 words per sentence)
in 70 ms! This was done using my small french sample grammar with 40 rules and
120 lexicon entries. The number of lexicon entries
is not important for the performance of the parser. I got similar achievements with large lexicons
(more than 9000 entries).
A large part of the theoretical background presented in the above mentioned parsers has been
developed by Prof. Klaus Brockhaus (1933-2011). He also wrote the very first version
of this parser in PL/I - I later rewrote this in C. He worked at the universities of Münster,
Heidelberg and Berlin. He was the first one to write practical and readable
grammars for substantial extracts of German and English - and by
"practical" I mean that the number of rules was held in a reasonable amount using
his "complex notation" for categories and rule constraints.
A typical parsing result (subcategories not shown)
The parsing result shown here - a bracket notation and a parsing tree - is based on the
small french grammar (40 rules) contained in the zip file. The output of the parser can
be internationalized by loading language specific message files and filters. If the input sentence
was ambiguous (on the base of the given grammar) then multiple bracket notations and multiple parsing trees are produced.
Dictionary maintenance
This dictionary maintenance program
may to be used under Windows to maintain the dictionaries
used in my context free
parser for natural languages (see above). It assumes that that the entries are written in
complex notation (see the parser above). It can probably used
also in other contexts (linguistics). The entries must have the general form key=value (see
the sample lexicon). The program includes:
- a pure Windows program dictmant.exe and dictmant.cpp (the source: uses pure Windows SDK, nor MFC nor any other class library)
- make files for Digital Mars C++ and Microsoft C++, a definition file, a ressource file
- surrounding files: an ini file, a sample dictionary (German)
This program is very basic. For more sophisticated you may have to modify and enhance it.
Screen dump for program Dictmant
In my chapter "Programming in D" I have included a rather short D program
(in subchapter "Input, output, directories, Unicode")
that can be used as a base for a similar program.
This program supposes UTF8 encoding. It has however no GUI component.
Tokenizer
A tokenizer is normally the first step in linguistic applications like
parsers or spell checkers. A tokenizer analyzes text and produces isolated words
from the texts. My version adds additional information:
- token coordinates (line number, start, length)
- simple categories (such as "/ca=DATE") if possible in this step.
Note that this is a batch like command line version written in C++ that can/must be enhanced for
the respective purpose. The zip archive contains the source and an executable
for Windows. The program can also be built under Linux/Unix. There is no make file in the archive as
the build process is very trivial.
Sample input and output
Copyright for all images, texts and software on this page: Dr. E. Huckert
Contact
If you want to contact me: this is my
mail address