In this bachelor’s thesis, we try to classify and identify written human
languages by studying the ordering of letters in text. Automatic
language identification is of interest in areas such as text indexing,
machine translation and natural language parsing.
Eleven written languages which use the Latin alphabet are considered
and modelled with a Markov chain on the letter level. Texts
from the New Testament and Wikipedia are used as training data.
The distances between the languages are then measured by using a
matrix-based metric on the transition matrices, and visualized in a
dendrogram. A probability-based distance measure is also used.
The matrix-based metric is then applied to language identification
by creating a transition matrix for the text whose language is to
be identified, and comparing the distances from this matrix to those
of the known languages; the shortest distance indicates the language
of the text. This is compared with maximum-likelihood classification.
We compare metrics based on different matrix norms, and also
study how the order of the Markov chains and the size of the training
data and sample texts for language identification influence the
The results indicate that the choice of matrix norm is important
and that the Frobenius norm and the 1-norm are the best norms
for language classification and language identification. Using these,
it is possible to generate satisfactory dendrograms, and accurately
identify the language of reasonably large texts. On the other hand,
1-norm cannot be recommended in this context; an explanation
is given for its bad performance.
Some languages are easier to classify correctly than others; the
Scandinavian languages are easy to group together, as are Spanish,
Portuguese and Italian. However, English, French, German and
Finnish are harder to classify correctly.
Written human languages, Language classification, Language
identification, Markov chain model, Matrix norms, Statistical
analysis of text.
2013. , 70 p.