Interpretability

In FoTran, we are interested in analyzing the neural language and translation models to understand their inner workings, how they pick up information and how that information is stored in the network. The distributed nature of neural models make it difficult to disentangle the information flow and interpretibility and explainability of predictions and generated output is a challenging task. In general, we can distinguish between intrinsic and extrinsic studies that may be hypothesis-driven or data-driven/explorative in nature. In sub-project 2 we focus on the intrinsic analyses of network parameters whereas sub-project 3 looks at the analyses by means of downstream and probing tasks. Here are some highlights of our research.

Language embeddings

Fully-shared translation models with language labels are interesting as they learn language embeddings in some high-dimensional space from raw data. Similar to the initial character-level language model, language spaces that emerge from training encoder-decoder models for translation show reasonable clusters as we can see at the following t-SNE plot from a model trained on Bible translations covering over 900 languages.

Bjerva et. al (2019) and Östling & Kurfali (2023) look into more details about what such language embeddings really represent.

Analyzing transformers

We were the first to analyze transformer parameters looking at the contextualized embeddings learnt from translation data and the attention patterns that emerge.

We observed that attention is often very sparse producing highly regular patterns such as the once shown above. These insights led to the possibility of replacing trainable attention heads with fixed patterns, effectively reducing model size and supporting low-resource scenarios with a valuable prior.

Comparing language and translation models

Pre-trained neural language models appeared shortly after the start of the FoTran project and changed the NLP landscape. Turning our attention to a comparison between language and translation models was a natural adjustment. Masked language models and translation model encoders show surprising difference in the representations they learn and we looked at contextualization of the embeddings in particular using measures of self-similarity (SelfSim) and intra-sentential similarity (IntraSim), see Ethayarajh (2019) for more details.

However, turning a pre-trained masked language model into a translation encoder is possible through fine-tuning and alignment. Representation Similarity Analysis (RSA) and Projection-Weighted Canonical Correlation Analysis (PWCCA) show the effect of such a transformation.

Training dynamics

Another interesting observation is the differences in training dynamics between translation and language models of different kinds. We measured loss change allocations (LCA) in encoder-decoder models and compared them to masked and causal language models.

Interesting differences appear depending on the training objectives applied. Middle layers in masked language models contribute less to the optimization of the loss function whereas translation models are more balanced. Causal language models are somewhat in between and reflect quite well the behavior of the translation model decoder, which is quite natural as they both represent generative components. The overall contribution of the feed-forward block is also remarkable. This demonstrates that a lot of adjustments need to be made in that component especially in upper layers.

Looking at the loss change allocation over time, we can also see the effect of strong activities in upper layers. Naturally, the biggest changes happen in the beginning after some warmup period. Masked language models seem to oscillate for a slightly longer time but this may be a scaling effect as well.

Aggregating LCA over individual attention heads is also interesting. In general, change allocations are quite evenly distributed but it also looks like certain heads do the heavy lifting in active layers.

Looking at those insights, it can be useful to re-think model architectures and training schedules. LCA provides valuable information that could lead to more efficient model design and training procedures.

Disentangle representations

One of the most interesting questions in interpretability studies is to find out what kind of information is encoded in neural language models and where. Can we identify the components that represent certain specific attributes or types of linguistic knowledge? We applied a de-biasing technique based on iterative nullspace projection that tries to identify a projection that removes information from pre-trained embeddings that can be used by a linear classifier to detect such information (see Ravfogel at el., 2020 for more information).

We conducted an experiment to study the imprint of passivization and negation on contextualized embeddings (in English), comparing masked language models and translation encoders. Looking at the plots above, we can see that translation encoders clearly mark passivization information at verbs, subjects and objects whereas BERT seems to put it on verbs only. This may indicate that the voice feature is more important across positions in translation tasks then it would be in masked word prediction. Changing or adding target languages may affect the situation but our limited experiment did not clearly show any patterns. Negation, however, seems to be mainly attached to verbs in both cases, masked language modeling and translation.

The nullspace projection procedure seems to be very effective in removing those specific properties as we can see in the plots above. From all layers, the distinction between the contrastive alternatives is basically gone, demonstrating the possibility to disentangle the contextual representations with a simple projection technique. However, the projection matrix is not easily ported across datasets, which shows the remaining difficulties in finding principled ways of extracting specific linguistic information from the complex knowledge structure incorporated in deep neural networks.

References