RusNLP
Content

RusNLP is a bilingual semantic search engine for papers presented in Russian NLP conferences:

We crawled all papers from these venues, starting from 2001, and rigorously extracted author names and affiliations. You can search papers by your query and get lists of papers relevant to the target one or to the query words, regardless of the publication language. This allows a large number of papers from the entire history of the Dialogue, which had been remaining mainly a Russian-spoken conference for a long time, and some papers from the early AIST, which accepted publications in Russian until 2014, be included into the search.

Usage examples

You can use it in the same way as Google Scholar or ArXiv Sanity Preserver to:

  • discover academic knowledge you were not aware of;

  • identify "gaps" in Russian NLP, where we still lack knowledge;

  • analyze academic communities and publishing patterns.

Questions that RusNLP can answer:

  • What papers of the Russian NLP community members can you read about syntactic parsing?

  • I know this paper, what other similar papers are there in Russian NLP?

  • What was published in 2008 by NLP scholars from Moscow State University?

  • Were there any papers about paraphrases detection at the AINL conference in 2015?

  • What were the Russian publications devoted to in 2019?

Current dataset stats:

  • Total papers: 2065

  • Unique authors: 1683

  • Unique affiliations: 393

Download the dataset as Sqlite database

More details about how the search works

Our search engine is based on distributive semantics — a concept that assumes that words with common meanings occur in common contexts. First, we averaged the vectors of all word forms in the text to get the paper vectors. Since we used special embeddings, in which Russian and English word forms are located in the same vector space*, it turned out that for each paper texts written in a different language can occur among the nearest neighbors — papers with the most similar meaning. Ypu can see the value of the cosine similarity between the query word/paper vector and the vector of each paper found in the Similarity column. In addition, we compiled a list of 24 main NLP tasks with keywords indicating that the paper belongs to this task. The Tasks column in the search results table displays the estimated tasks for each paper, which are also selected using cosine similarity.

*As the cross-lingual embeddings we used MUSE — fastText models pre-trained on Wikipedia by the Facebook.

Publications about the project

RusNLP is a part of a larger project called Analysis of publication activity in Russian comutational linguistics.

  1. Amir Bakarov, Andrey Kutuzov and Irina Nikishina. Russian computational linguistics: topical structure in 2007-2017 conference papers // Dialogue-2018

  2. Irina Nikishina, Amir Bakarov and Andrey Kutuzov. RusNLP: Semantic search engine for Russian NLP conference papers // AIST-2018 (Slides)

  3. Irina Nikishina and Andrey Kutuzov. Double-Blind Peer-Reviewing And Inclusiveness In Russian NLP Conferences // AIST-2019 (Slides)

  4. Anna Safaryan, Petr Filchenkov, Weijia Yan, Andrey Kutuzov and Irina Nikishina. Semantic Recommendation System for Bilingual Corpus of Academic Papers // AIST-2020 (Slides)

  5. ...

Project team

Core team (in alphabetic order):

The multilingual search was implemented by students of the Computational Linguistics masters program at the Higher School of Economics, Moscow during the student project of 2019-2021 years. In alphabetic order:

  • Anna Safaryan

  • Dmitry Kutsev

  • Petr Filchenkov

  • Weijia Yan

Web service source code


Creative Commons License
RusNLP by https://nlp.rusvectores.org is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License.