Automatic Syllabification

I am still sometimes working on [readable.js] and during that I learned many interesting things. The Flesch-Kincaid family of readability metrics involves the number of syllables in a amount of text. At the moment we use a very simple heuristic, found in [Talburt1985] (but, according to [Liang] is already described in a publication by Berg from 1975), which assumes that each non-consecutive vowel is one vowel, with a few exceptions. It is not very accurat, but for a larger segment of text, the errors probably cancel out and the relative errors get negligible. Readability analysis is not a exact science anyway.

Even though for [readable.js] the simple heuristic is accurate enough, I was getting interested in learning about ways how one can automatically syllabify words. I will try to summarize my findings in a series of posts. This post gives an overview over different approaches that have been published, and explains one general principle that is used with slight variations in many of these methods.

Trying to find what other people did

One frequent problem that I often have when I try to understand something from a field I don't know is that I have no idea what search terms to use. In this case it was even more difficult because people use different terms to refer to the same thing or the same terms to refer to different things. For example in some papers hyphenation and syllabification are used exchangeably, elsewhere syllabification happens in the phonetic domain and hyphenation in the orthographic domain and sometimes hyphenation refers to the task of identifying some, but not necessarily all syllable breaks.

But I managed to find out that there are two classes of approaches: The first one is to manually develop a set of rules for syllabification using linguistic insight. This leads to language specific algorithms, and the rules get quite complicated. [Tsylb] is an example that uses such rules.

The other class of approaches is to utilize machine learning, and let the computer algorithmically derive the rules from a set of training data. [NETtalk] seems to be one of the early and important projects and influenced a lot of research. It uses neural networks to carry out a very similar task (letter-to-phoneme mapping). Other machine learning techniques can also be used, like Support Vector Machines (SVM) [Bartlett] or Conditional Random Fields (CRF) [Trogkanis]. And there is syllabification-by-analogy, which apparently performes quite good, but I could not really grasp the idea of the algorithm.

The hyphenation algorithm of [Liang], that is used for example in TeX, is a bit different in that it only tries to give not all correct, but also no wrong hyphenation positions, as a few hyphenation points are usually enough for typesetting. It is also more to the machine learning class, because the used patterns are trained using a big dictionary of syllabified words. There exists a Python implementation [Batchelder].

A conceptually quite simple and nice algorithm is called [IGTree]. It builds a relatively compact tree structure from the training data and can then pretty quickly syllabify a given word.

Tagging problem

The task of syllabification is usually mapped on a tagging problem with two tags, where for each position between two letters in the word, the algorithm has to determine whether a syllable boundary is present at this position or not. In most cases this decision has to be made only from a certain amount of context.

For example, if we choose two letters of context before and after the position and consider the position exa-mple, we have the context "xa mp". If the context extends past the end or beginning of the word, we use the special character "_", so for the position ch-aracter and three letters of context in each direction, we have "_ch ara".

In the simplest case the amount of context is fixed. This is the case for the [IGTree] method. In this case we can pose the problem differently: we want to find a function that maps from the set of possible contexts to the set of tags {0,1}, where 0 indicates the absence and 1 the presence of a syllable break.

In [Liang] a pattern matching procedure is used where patterns can indicate both the absence and presence of a break, by using numerical markers. The patterns contain varying amounts of preceding and following letters, but to connect back to the fixed context approach, we can think of the patterns as equivalence classes of sufficiently large fixed contexts. For example the pattern "n1te" can be seen as the following set of fixed two letter contexts {"an te", "bn te", "cn te", ..., "zn te"}. Using such patterns allows to encode the tagging for a large number of contexts extremely space efficient.

In [Bartlett] more involved tagging schemes are used, where not only two tags (syllable break or no syllable break) are distinguished, but also additional tags encoding additional linguistic information. In [Trogkanis] Conditional Random Fileds are used, which consider also the correlation between tags, by operating on the whole word at once. This avoids for example that two consecutive positions get assigned a break tag.

In the next post I will try to explain Liangs Algorithm and the data structures that are used with it.

[readable.js]

(1, 2) https://github.com/jreinhardt/readable.js/

[Talburt1985]

Talburt, J. The Flesch index: An easily programmable readability analysis algorithm Proceedings of the 4th annual international conference on Systems documentation, ACM, 1985, 114-122

[Webster]

The Gutenberg Webster's Unabridged Dictionary by Project Gutenberg http://www.gutenberg.org/ebooks/673

[NETtalk]

http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)

[Bartlett]

(1, 2) Bartlett, S.; Kondrak, G. & Cherry, C. Automatic Syllabification with Structured SVMs for Letter-to-Phoneme Conversion Proceedings of ACL-08: HLT, Association for Computational Linguistics, 2008, 568-576

[Sba]

Marchand, Y.; Adsett, C. R. & Damper, R. I. Automatic Syllabification in English: A Comparison of Different Algorithms Language and Speech, 2009, 52, 1-27

[IGTree]

(1, 2) Daelemans, Walter, A. v. d. B. & Weijters, T. IGTree: Using Trees for Compression and Classification in Lazy Learning Algorithms Artificial Intelligence Review, 1997, 11, 407-4023

[Batchelder]

http://nedbatchelder.com/code/modules/hyphenate.html

[Trogkanis]

(1, 2) Trogkanis, N. & Elkan, C. Conditional random fields for word hyphenation Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2010, 366-374

[Liang]

(1, 2, 3) Liang, F. M. Word Hy-phen-a-tion by Com-put-er Stanford University, Department of Computer Science, 1983

[Tsylb]

http://www.itl.nist.gov/iad/mig/tools/

The Pseudotron