1995 BIO Round Two Solution to Q2

1995 BIO Round Two Solution to Q2
Language Recognition Problem

Question Statement

You are supplied with a number of representative files from a number of different languages. All languages use the Latin alphabet A to Z; case can be ignored and the only punctuation characters are . , ; : and single space characters separating words. For example, you might have the files FRENCH.1, FRENCH.2 to FRENCH.9, ENGLISH.1 to ENGLISH.6 and so on.

Given a file TEXT in the same format but of unknown origin, how would you go about programming a computer to identify its language?

This question was designed to be extremely open-ended to encourage students to think in broad terms about a range of possibilities, in the context of this (difficult) problem. It was taken from the 1991 IOI in Athens, where it was proposed (but not set) as a programming problem. It was rejected as the relative performance of different programs would be very hard to judge with realistic data. That is, you would have to be an expert in the field to produce a good algorithm!

Language recognition is a subset of pattern recognition, a large and rapidly-growing field. The answer given below will only scratch its surface.

I'm hoping to get hold of the original program and test data, which will be made available here.

As a wide range of answers were acceptable, let us consider all the different factors and methods which characterise a language. We will comment in detail later. Bear in mind that the school students doing this problem will have almost no knowledge of statistics and will not know enough Maths to use the advanced classification techniques.

Features in the Language Files
The things that can be used to distinguish one language from another.
- Vocabulary (the dictionary of words it uses)
- Letter frequency, and (less useful) punctuation frequency
- Word length
Combinations of Features
The order in which the above occur can also be characteristic of a language.
- Relative position of words in sentences and to each other
- Frequency of digrams and trigrams (pairs and triples of letters)
- Word endings
Statistical Matters
As we are told nothing about the data files, we may have to make some guesses.
- Are the files representative?
- Statistical Significance
- Overlap (features common to different languages)
Classification Methods
How are we to use the language data to tell between the different files?
- Frequency of word matches
- Closeness of one probability distribution to another
- Modelling based on single language
- Closeness to an individual file
- Probability-based (parametric models) as against Closeness (non-parametric models)
- Dealing with overlap: good and bad features
Computational Issues
- Model size
- Computation time

Any student who produced the above list, with short comments to each point (about as much as we could expect in half an hour), would receive very high marks. It should be stressed that the question is not looking for details of a specific algorithm, and it is possible for students to waste a lot of time on this.

Features and their Significance

As we did not expect students to know many different languages, we expected to find fairly general comments.

Words can be a good indicator. But on their own, they have problems. Even fairly common words might not be found in some files. For a good example, "the" is a very common word in English, but that was its first occurrence in this text for several paragraphs! Different languages often have words in common (for example, "la" can be found in English (though not very often), French, Italian, and others), and this means that it is not possible simply to look for matching words.

That said, short words in general are quite a good measure. We could easily produce a dictionary of, perhaps, the 30 shortest words found in the training data (or maybe those up to 4 characters in length), together with the frequency of occurrence of each word. Long words are less likely to re-occur and will take too much storage space.

Letters are good, as even relatively small files will contain large numbers of them. Unfortunately, it is often the least frequently found letters that are the best measures, and we will know the least about them! But we could easily build up a database of the relative frequency of occurence of each letter for each language.

Combinations of letters are also a good indicator, and characterise a language better than individual letter frequency. For example, "sch" is uncommon in English (except "school" and related words), and could be a good way of telling English from German.

Pairs and triples of characters, therefore, could be a useful measure; perhaps including a "word end" symbol would allow us to consider endings as well, as these are extremely characteristic. But we must check their computational size: ignoring word endings, there are 26*26=676 different pairs (as we are told to ignore case) and 17576 different triples. Only some of these would of course be found in real data, but searching through a long list for every character in the unknown file could take a long time! So as with the words, we might choose only to store the more common sequences. Alternatively, we might use our knowledge of pronunciation to suggest alternative storage schemes, such as considering only individual vowels ("aeiou" and possibly "y"), diphthongs (vowel combinations such as "ou"), or all-consonant combinations ("sch", "thr"). But this risks losing information if our external assumptions are wrong.

Word Order and position in a sentence could be tricky to code, and its benefits are not clear. While there is potentially a lot of information here (after all, it is this that gives the text its meaning), we could easily generate a huge amount of data and still find it to be unreliable for characterising a language. Word Length is a good example of a very poor measure. It is far better for telling between different types of text (registers), such as spoken, scientific, or political, than for telling between languages.

Unless finding the correct register is what is required, we should avoid this option.

Statistical Significance

Given that we only have a limited amount of training data, we have to make some guesses. This means our model may be wrong. So what do we need to ensure that it is as correct as possible?

The training data files must be large enough to represent the whole range of registers (spoken, scientific, political). (If, conversely, we are required to distinguish between these, we will need even more data!)

We cannot estimate the probabilities of different features occurring (or create a dictionary) without a reasonable number of observations. This is of special importance for distinguishing between subtlely different data (for instance, English and French).

But in many cases there is much "randomness" within a language, which shows as high variance between samples. When comparing languages, their differences must be much larger than this internal variation (standard deviation). If not, we will be unable to tell reliably between different languages (although we may have a good idea that our decision is likely to be wrong). Without checking specific and uncommon spellings (such as harbour/harbor, colour/color), a human can hardly tell written English from American, so how could a machine do so with far less knowledge?

As a "guesstimate", I suggest that we would need to consider 10-100kB of files for each language (about 1500-15,000 words) in order to have a reasonable chance of recognising them.

Classification Methods

For this section we assume that we have a set of processed training data and an unknown sample, and a way of producing a list of characteristic numbers, such as frequency of letter occurrence, from each.

We will meet the subtle distinction between parametric methods (which produce a small "model" list of numbers from each file or set of files) and non-parametric methods (which don't). It is also the case that we can consider each data file individually (for example, to tell between registers) for a purely closest (or "nearest-neighbour") fit, or try to gain extra information from the difference between the files (producing a "variance"). This distinction could even be applied to individual sentences if our goal is to create a parametric model.

The final decision may require a combination of "closeness" values taken from different techniques.

Words (from whatever kind of dictionary is built up) will probably be used by simply finding the number of matches between words in the unknown file and the dictionary.

Letters need a more sophisticated treatment. A good method would be based on producing, for each file or language, a list of the number of occurrences of each letter divided by the total number in the file - that is, a value for the expected probability of occurrence of each.

We can then find a measure for the closeness of a match between the list taken from our unknown file and a list from one of the given files, by subtracting one from another, squaring the results (so that they are all positive) and adding these values. If the final sum is square rooted, we have the familiar Pythagorean (or Euclidean) "distance", allowing us to find the "nearest neighbour" file. This is a non-parametric method.

Alternatively, all the data files of a particular language can be grouped together. This provides a better estimate of the probabilities as there is more data, although it means our program is now unable to tell what register the text is in by matching an individual file. (In fact, its performance probably would never be that good, so this is not a disadvantage.) We can also sum the squared differences between the probabilities in each file, to obtain the variance, and square root this to get a standard deviation.

What this means is that we are extracting a set of parameters (mean and standard deviation) from the file.

Those familiar with statistics will see that, if we assume a Gaussian or Normal distribution for this data, we can scale ths values of distance described in the previous paragraph by the standard deviations to get a better idea of how good each letter is at characterising the data. This method only works well if we have lots of files, however. Because of the advanced statistics involved in this discussion, we did not expect students at school level to make comments like this! What is done in practice is to take the covariance matrix from the data vectors (which in this case are the relative probabilities of occurrence), as this gives us more information. Taking the eigenvector decomposition of this (which is tricky for 26 by 26 matrices!) could allow feature extraction. Unfortunately, the size of the data here makes such analysis impossible.

We can deal with combinations in a similar way to individual letters. But now the amount of data is potentially quite large and so the more complicated parametric methods would not be very good. So we would use Pythagorean distance to compare the files.

Finally, it is necessary to produce an overall measure of which language is best. Probably the most effective way to do this is to produce a set of scores for each test, add these up, and then choose the best language (which most closely matches the unknown file) based on this.

Further Work

Exercise 1
Write a program which characterises your language based on, for example, pairs or triples of letters or punctuation, together with word length, and give it some text files to learn from.

Now adapt it to produce random (but representative) text, and read it out. Is this how Lewis Carroll got his Jabberwock poem?

'Twas brillig and the slithey toves
Did gyre and gymbol in the wabe.
All mimsy were the borogroves
And the mome wraths outgrabe.

(Apologies for any Alice fans who remember this better than I do.)

If you are feeling mischievous and your program's "poetry" is good, you could really confuse someone who is trying to learn your language...!

For more examples and an introduction in this field, see Claude E. Shannon's book, The Mathematical Theory of Communication (1949).

Exercise 2
Actually, the method described above won't work very well. A better way to analyse the problem uses conditional probability. This means that you consider the probability of a certain character appearing based on the previous one or two characters. You will need quite a large amount of test data for this, and you are likely to have problems with the 64k data structure limit on the PC, if programming in Pascal. (You can get around this by using smaller elements and arrays of pointers to them.)

Related Fields

The field of understanding language, probably the best way to tell between different languages, is rapidly-growing. You can even buy software designed to help with translation by almost doing it for you. For those interested, many research establishments have jobs in this area.

The techiques of classification can also be applied in speech recognition applications, computer vision, and elsewhere. More research jobs!

Antony Rix
contact details