AI Dubbing

Voice Preserving Speech-to-Speech Translation with Lip-Sync

Prosody transfer is well-studied in the context of expressive speech synthesis. Cross-lingual prosody transfer, however, is challenging and has been under-explored to date. This task naturally emerges when one needs to translate an audio or video recording with expressive speech and preserve original voice and emotions. We study and propose novel solutions to learn prosody representations that are transferable across languages and speakers for machine dubbing of expressive multimedia contents. Multimedia contents often contain field recordings and thus require prosody transfer from noisy audios. Our goal is to propose a system that could generate speech with context-matching prosody with the same quality as a human voiceover artists.

Code search and clone detection

Align code snippets in different programming languages

We consider the problem of finding code snippets that operate identically but are written in different programming languages. We present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages. CCT allows to align code snippets in different programming languages, with new SotA results by CCT-LM on POJ-104 and XCD (our new dataset).

Grammatical error correction

An open-vocabulary iterative non-autoregressive Grammatical error correction (GEC) model

Grammatical error correction (GEC) is an important NLP task that is currently usually solved with autoregressive sequence-tosequence models. However, approaches of this class are inherently slow due to one-byone token generation, so non-autoregressive alternatives are needed. In this work, we propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network that outputs a self-attention weight matrix that can be used in beam search to find the best permutation of input tokens (with auxiliary hinsi tokens) and a decoder network based on a step-unrolled denoising autoencoder that fills in specific tokens. This allows us to find the token permutation after only one forward pass of the permutation network, avoiding autoregressive constructions. We show that the resulting network improves over previously known non-autoregressive methods for GEC and reaches the level of autoregressive methods that do not use language-specific synthetic data generation methods. Our results are supported by a comprehensive experimental validation on the ConLL-2014 and Write&Improve+LOCNESS datasets and an extensive ablation study that supports our architectural and algorithmic choices.

Multimodal ranking and retrieval

novel method for postprocessing ranking results

A recent trend in multimodal retrieval is related to postprocessing test set results via the dual-softmax loss (DSL). While this approach can bring significant improvements, it usually presumes that an entire matrix of test samples is available as DSL input. This work introduces a new postprocessing approach based on Sinkhorn transformations that outperforms DSL. Further, we propose a new postprocessing setting that does not require access to multiple test queries. We show that our approach can significantly improve the results of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus achieving a new state-of-the-art on several standard text-video retrieval datasets both with access to the entire test set and in the single-query setting.

Topological Data Analysis

We have applied TDA to speech processing, analyzing the HuBERT Transformer and finding excellent features from its attention maps

TDA finds very informative features based on topological invariants. We apply topological data analysis (TDA) to speech classification problems and to the introspection of a pretrained speech model, HuBERT. To this end, we introduce a number of topological and algebraic features derived from Transformer attention maps and embeddings. We show that a simple linear classifier built on top of such features outperforms a fine-tuned classification head. In particular, we achieve an improvement of about accuracy and ERR on four common datasets; on CREMA-D, the proposed feature set reaches a new state of the art performance with accuracy . We also show that topological features are able to reveal functional roles of speech Transformer heads; e.g., we find the heads capable to distinguish between pairs of sample sources (natural/synthetic) or voices without any downstream fine-tuning. Our results demonstrate that TDA is a promising new approach for speech analysis, especially for tasks that require structural prediction.