Changing “his” to “her”: Teaching Computers English
If you were to convert the sentence “She did a great job” to a male version, you’d have a pretty easy time coming up with “He did a great job”. Converting “Her vocabulary is very good!” would probably be easy too, just like “I gave her five stars as a reward” would come naturally. It’s remarkable that a computer would excel at the first sentence while failing horribly at the latter two. In this post, I will explain why language processing is surprisingly complicated for computers and how I solved this problem for the FeedbackPanda platform.
At FeedbackPanda, we offer our users who are teaching English as a second language a system to quickly create student feedback for the lessons they’re teaching. The core feature is the ability to create templates to generate lesson-specific feedback. These templates come in two versions, one for male students and one for female students. To make it faster for our teachers to create such templates, I wanted to build a system to automatically “translate” between the male and the female version. So I set out to “just build it”.
The naive approach
Initially, I thought about how a human would do it. Between “he” and “she”, it’s a simple substitution. That was very easily programmed and worked well. Encouraged by this, I built a list of text substitutions. When I reached “him” and “his”, I noticed that both words would translate into “her”. My enthusiasm crumbled at that point: while this would work for male-to-female translations, it would fail for female-to-male. How would I distinguish the possessive “her” as in “she read her book” from the object pronoun “her” as in “I gave her a book”? The program certainly couldn’t.
Why machines don’t understand
The computer only sees letters, not meaning in text. We humans grasp meaning so easily in common-day language, but for a machine, this is one of the most complicated endeavours. A program can manipulate millions of letters in a text within milliseconds, but it has no concept of context. This is why automated translation systems still struggle, even though computing power and storage space have gone through the roof within the last decades. But something happened that allowed for progress.
Enter Machine Learning: Teaching the Computer English
Have you ever tried translating a piece of Chinese text into English using a service like Google Translate? It usually translates into incomprehensible gibberish or something barely understandable. But if you try translating French to English, it’s usually pretty good. Why is that? One would assume it’s because French and English are very similar, while Chinese is conceptually and grammatically very different from English. Although, that could be true, when looking at computers, it barely matters.
What matters, is called “the corpus”. In a nutshell, it’s the essential ingredient in linguistic machine learning, usually called Natural Language Processing. It works like this:
1. You collect as many texts with identical content in both languages.
2. You feed them into a machine learning system.
3. The machine does hours, sometimes days worth of statistical calculations and arrives at a “model”, which it uses to translate.
Now you can feed it text in one language and it will provide a translation!
But what influences this quality? There is a lot of math to machine learning, and I won’t go there. The most important factor in producing a good model is the amount of text that is fed into the system. And that is the so called “corpus”.
The reason why English to French translation works is because there is a lot of “corpus”.The European Union publishes every single law, all public documents and journal in 24 different languages — openly and for free. They have to, so each member state has the same clarity when it comes to interpretation. This means, that there is a gigantic corpus of text for 24 languages, which any machine learning system can ingest to create a translation model. There is no such treasure trove for English/Chinese documents. In general, it is more likely for a text to be published in English and in French than in both English and Chinese. This explains why it is so easy to produce a usable model for English-to-French, but so hard to build something for Chinese.
But where to find the corpus?
After reading up on all this, I researched some more into existing tools that would help me build something for male-to-female and vice-versa. I found that if I thought at “male english” and “female english” as distinct languages, it would be just another language-to-language translation. The tools I found all required a corpus of two languages, so I needed to find some data to generate these text collections from.
And I found them within our own system. Our users had already generated templates, both male and female, with only the pronouns different. So with only a couple of hundred different texts, I built the first prototype using a Statistical Machine Translation (SMT) system called moses. The results were very encouraging! It would even translate “her” to “him” and to “his” correctly — some of the time. To find out why, I had to look into what Statistical Machine Translation actually does.
Statistical Machine Translation — the Basics
A SMT system look at how often a certain source language term (which can be a word or a group of words) gets translated into a certain target language term. After looking at all these replacements, it calculated a probability of how likely it is for a term to be translated into another. For single words, it still produces a lot of errors, since it will just choose the most likely translation — every time. So all “her” would be translated into “his”, if that is the most common translation for that corpus.
So I looked into giving the system more context. A feature of machine translation systems is that they are able to look at a thing called n-grams. What’s an n-gram? It’s pretty much a group of words, with n being the number of words per group! Easiest shown by example:
Sentence: “She took her lunch to school” → “He took his lunch to school”
n-grams of 3 (also called trigrams):
— “She” “took” “her” → “He” “took” “his”
— “took” “her” “lunch” → “took” “his” “lunch”
— “her” “lunch” “to” → “his” “lunch” “to”
— “lunch” “to” “school” → “lunch” “to” “school”
Woah, that is much more data to work with! And the SMT system has a lot more context, since it does not only see what “her” gets translated into, it also gets to see the surrounding words, which are likely to appear in other sentences that the system will need to translate later. Thus, the system assigns a high probability to “her” → “his” translations when “her” is preceded by “took” or followed by “lunch”.
And this is exactly how humans do it. We contextualize and have rules about the semantical structure of our sentences and word groups, from which we infer the correct substitutions. Our brains have built such a large and quickly accessible ruleset that we can easily translate this without even thinking. The computer is not so lucky, its memory is significantly smaller than that of a human — and it needs to check all of it’s rules for every sentence it translates. And what we learn in many years of speaking a language, a computer has to learn in a much short time. With all the computations needed to calculate the probabilities, it can easily take multiple hours to generate a model for this kind of translation.
I set out to build a system translating words like “her”, “him”, “she” or “herself”. What happened was that I got a system that would learn more than I expected, in a good way.
It learned to translate “lady” to “gentleman”, since they appeared in a few templates. So gender-specific substitutions are also learned, not just pronouns.
It learned not to translate words in quotation marks. A sentence like “He learned the words ‘he’, ‘his’ and ‘him’ today!” would surprisingly translate into “She learned the words ‘he’, ‘his’ and ‘him’ today!”, leaving the quoted words intact. This is where the probability approach is visible, since “his” has a much higher probability to be translated into “her” than the quoted “‘his’”. Amazing!
At this point, I constantly update the system with all of FeedbackPanda user’s templates to increase the corpus size. With over 20000 templates in the database, the system is getting better and better at translating between male and female and the other way around. It’s growing quite large though. It takes about 8 hours to train the model for both translations (so it’s 4 hours each way) and the resulting system needs over 5 gigabytes of memory to just load the model — that would equal 2.5 million pages of printed translation rules.
And it’s still not perfect. Just like with Google Translate, the FeedbackPanda system will produce wrong translations and translate things in a way that you wouldn’t necessarily translate them yourself. But it will improve with every template added to the FeedbackPanda platform.
I’ll be tweaking the system in the future, and I need your input — provided you’re a user of the FeedbackPanda platform. Please reach out to me at firstname.lastname@example.org if you get a wrong translation. Please provide both your original version and the translation from FeedbackPanda. That will help me quite a bit!
If you’re just interested in a SMT system that translates male texts to female and back, feel free to reach out to me as well.