This is a small, hand-maintained, list of automated text processing tools.

General-Purpose Preprocessors

  • m4 - a macro language with some open-source implementations, including GNU m4. (I personally find it very vile.)

  • GPP - a general-purpose preprocessor. Supports several alternative syntax modes. Open source (GPL).

  • filepp - an adaptation and extension of the C preprocessor for general-purpose use. Written in Perl. Open source (GPL-2-or-later).

  • chpp (Chakotay Preprocessor) - a powerful preprocessor that aims to be non-intrusive, and which can be considered a full-fledged programming system. Has been unmaintained since 1999. Open source (GPLv2).

General-purpose Template Systems

  • Template Toolkit - a flexible and highly extensible template processing system for Perl. Open source (same terms as Perl).

  • ClearSilver - a language-agnostic and fast templating system written in C.

  • Jinja2 - a “full-featured” template engine for Python 2 and Python 3. Open source under a BSD-style licence.

  • Tenjin - “the fastest template engine in the world” - available for several dynamic languages.

  • eRuby - a Ruby-based template system with several implementations. Open source.

  • Smarty - a PHP Template Engine. Open Source.

  • HTML-Template and Text-Template - two other CPAN template systems popular in the Perl world. Open Source.

  • Cheetah - a Python-Powered Template Engine. “Fast, Flexible, Powerful”. Open Source. Has been unmaintained since 2010 and does not support Python 3.

Parser Generators

  • Yacc - a LALR parser generator standard, with some popular implementations such as Berkeley Yacc (byacc) (Open source, public domain) and GNU Bison (Open source, GPLed).

  • ANTLR - “ANTLR, ANother Tool for Language Recognition, is a language tool that provides a framework for constructing recognizers, interpreters, compilers, and translators from grammatical descriptions containing actions in a variety of target languages.” Open Source (3-clause BSD licence).

  • Parse-RecDescent - a parser-generator for Perl 5. Open source (same terms as Perl).

  • Marpa - a parser than aims to be able to parse everything in BNF. Open source (LPGL-version-3-or-later).

  • SGLR, the Scannerless Generalized LR Parser.

  • Regexp::Grammars - “Add grammatical parsing features to Perl 5.10 regexes”.

  • Parser::MGC - build simple Recursive-Descent parsers in Perl.

  • Lemon Parser Generator - an LALR parser generator for C that is maintained as part of the SQLite project. Open source (public domain).

Regular Expression Libraries

Diffing and Patching Tools

  • GNU Diffutils - an open source (GPLv3+) package which provides diff and other programs.

  • GNU patch - apply a patch/diff file. Open source (GPLv3+).

  • patchutils - Patchutils is a small collection of programs that operate on patch files. Open source.

  • comm - a UNIX command used to compare two files for common and distinct lines.

  • Meld - a GUI diff/merge tool for gtk+. Open source.

  • KDiff3 - a GUI diff/merge tool for KDE. Open source.

  • GNU wdiff - a front-end to GNU diff for comparing files on a word-per-word basis.

Specialised Processors

XML Processors

Standard UNIX Text Processing Tools

  • echo - output strings (with some possible transformations).

  • cat - output or concatenate files.

  • cut - extract sections from each line of output.

  • head - start of stream.

  • tail - end of stream.

  • paste - join multiple files horizontally.

  • sort - sorts input.

  • csplit - split files based on context lines.

  • join - merges lines of two files based on commonalities.

  • uniq - collapses adjacent lines, and makes the output unique.

  • grep - search for lines matching regular expressions.

  • sed - stream editor - a mini programming language for text processing, based on the ed text editor.

  • AWK - an even more full-fledged programming language for text processing in UNIX (with some quirks, and idiosyncrasies).

Some General-Purpose Programming Languages with Good Text Processing Support


