Ted Gibson, MIT Dept of Brain & Cognitive Sciences
Information processing and cross-linguistic universals
Finding explanations for the observed variation in human languages is the primary goal of linguistics, and promises to shed light on the nature of human cognition. One particularly attractive set of explanations is functional in nature, holding that language universals are grounded in the known properties of human information processing. The idea is that grammars of languages have evolved so that language users can communicate using sentences that are relatively easy to produce and comprehend. In this talk, I summarize results from explorations in two linguistic domains, from an information-processing point of view.
First, I consider communication-based origins of lexicons of human languages. Chomsky has famously argued that this is a flawed hypothesis, because of the existence of such phenomena as ambiguity. Contrary to Chomsky, we show that ambiguity out of context is not only not a problem for an information-theoretic approach to language, it is a feature. Furthermore, word lengths are optimized on average according to predictability in context, as would be expected under an information theoretic analysis. We then apply this simple information-theoretic idea to a well-studied semantic domain: words for colors. And finally, I show that all the world’s languages that we can currently analyze minimize syntactic dependency lengths to some degree, as would be expected under information processing considerations.
Processing non-canonical data: challenges & opportunities
Successful Natural Language Processing (NLP) depends on large amounts
of annotated training data, that is abundant, completely labeled and
preferably canonical. However, such data is only available to a
limited degree. For example for parsing, annotated treebank data is
only available for a limited set of languages and domains.
In this talk, I review the notion of canonicity, and how it shapes our
community’s approach to language. I argue for leveraging what I call
fortuitous data, i.e., non-obvious data that is hitherto neglected,
hidden in plain sight, or raw data that needs to be refined, and
present on-going work on using fortuitous data. If we embrace the
variety of this heterogeneous data by combining it with proper
algorithms, we will not only produce more robust models, but will also
enable adaptive language technology capable of addressing natural