Link to paper: https://arxiv.org/abs/2307.03322

Paper published on: 2023-07-06

Paper's authors: Abhirut Gupta, Ananya B. Sai, Richard Sproat, Yuri Vasilevski, James S. Ren, Ambarish Jash, Sukhdeep S. Sodhi, Aravindan Raghuveer

GPT3 API Cost: $0.03

GPT4 API Cost: $0.10

Total Cost To Write This: $0.13

Time Savings: 14:1

The TLDR:

People sometimes make spelling mistakes when using the internet in a language they are not very good at.
The researchers made a model called Bi-Phone that can create fake spelling mistakes based on the sounds of words.
They also made a test called FunGLUE to see how well language models can understand these fake spelling mistakes.
The researchers found that spelling mistakes are common on the internet and their model can make realistic ones.
They tested two language models and found that they had trouble understanding the fake spelling mistakes.
The researchers suggest a new way to train language models to handle these spelling mistakes better.
In the future, they want to improve their model and test it on different languages.

The Deeper Dive:

Understanding Phonetic Corruptions in Web Language Use: A Dive into Bi-Phone and FunGLUE

This research paper presents an intriguing exploration into the phonetic corruptions that arise when individuals use the web in a language they have low literacy in - a common occurrence due to technology asymmetries. The authors introduce a generative model, Bi-Phone, that creates synthetic spelling corruptions based on mined phoneme confusions between native language (L1) and second language (L2). They also present a new benchmark, FunGLUE, to further research in phonetically robust language models.

The Problem of Phonetic Corruptions

The authors focus on two main problems: estimating the likelihood of phoneme shifts and creating a model for sampling phonetic misspellings. To understand the likelihood of phoneme shifts, they analyze the frequency of each possible sound-shift in a corpus. This is a significant step as it provides a data-backed understanding of how phonetic shifts occur between different languages.

The second problem, creating a model for sampling phonetic misspellings, is addressed by representing it as a probability distribution. This model uses a combination of a phoneme-phoneme error model and a phoneme-grapheme density model to generate misspellings.

The Bi-Phone Model

The Bi-Phone model is the centerpiece of this research. It uses a phoneme confusion matrix to estimate the likelihood of generating a corrupted phoneme sequence. This is the phoneme-phoneme error model. It then uses a pronunciation dictionary to convert phoneme sequences to graphemes, which are the smallest units of written language. This is the phoneme-grapheme density model.

The model samples misspellings by greedily picking the top candidates at each position. This means it chooses the most likely misspelling at each step. The generated misspellings are then evaluated for plausibility by native speakers of different languages.

The Prevalence of Phonetic Misspellings

The authors use a large web corpus to analyze the prevalence of phonetic misspellings. They evaluate the coverage and precision of the model's misspellings at different confidence thresholds. The analysis shows that phonetic misspellings are prevalent in web data and can be accurately generated by the model.

FunGLUE: A Benchmark for Phonetic Robustness

FunGLUE is a benchmark introduced to test the robustness of language understanding models to inter-language phonetic spelling variations. It randomly selects words from tasks in the SuperGLUE benchmark and corrupts them with Bi-Phone based misspellings. The training set in FunGLUE is left clean to mimic real-world scenarios where noisy training data is difficult to obtain.

The authors test state-of-the-art models, mT5 and ByT5, on FunGLUE and find that they show a drop in performance compared to SuperGLUE. This indicates that the phonetic misspellings introduced in FunGLUE pose a significant challenge for current models and training schemes.

Pre-training for Phonetic Robustness

The researchers propose a novel pre-training task of phoneme prediction to handle phonetic noise in inputs. The task of predicting phoneme sequences teaches the model "phonetic information" and helps embed similar sounds and words close together.

The models are trained on this mixture task and then fine-tuned on the standard clean SuperGLUE training set. The phoneme prediction data is created using the Common Crawl English data and an off-the-shelf Grapheme to Phoneme model.

The phonetically pre-trained ByT5 model shows improved performance on the FunGLUE benchmark compared to the vanilla pre-trained model. However, the mT5 model does not show the same gains, possibly due to harder sub-word tokenization.

Future Directions

The authors highlight several areas for future exploration. The current approach assumes independent phoneme/grapheme corruptions, but this can be relaxed to model contextual phonetic shifts. The coverage analysis is conservative and does not cover user-generated data from social media. The work is also difficult to extend to low-resource languages without appropriate datasets and models for transliteration.

In conclusion, this research provides a valuable framework for understanding and modeling phonetic corruptions in web language use. By introducing the Bi-Phone model and the FunGLUE benchmark, it opens up new avenues for making natural language understanding models more robust to L1-L2 phonetic shifts.

Notes on BiPhone: Modeling Inter Language Phonetic Influences in Text

The TLDR:

The Deeper Dive: