About the Nordic Tweet Stream

The Nordic Tweet Stream - what it is, and what it isn't?

The Nordic Tweet Stream (NTS) is a monitor corpus of geolocated tweets and associated metadata from the Nordic region. It offers access to very large-scale data covering about a decade, from January 2013 to May 2023 when this social networking platform was known for its old name. .

The corpus contains ~900 million tokens from over ~900,000 user accounts. Altogether, there is material in 73 different languages (22 languages have been tokenized, POS-taggeg, and lemmatized), with the largest languages being Swedish (approximately 31%), English (approximately 26%), and Finnish (approximately 13%). Detailed information on the material can be found in the Statistics section.

The NTS builds on the idea that access to very large social media data for research cannot be left solely to social media companies. We need parallel storage of these valuable cultural heritage data so that they are available for researchers. We operate according to the FAIR Data Principles (see Fair). The guiding principles of FAIR aim at making data findable, accessible, interoperable, and reusable (Wilkinson et al., 2016). Currently, the Digital Single Market directive (2019/790) of the European Union makes text and data mining for scientific research purposes possible. It ensures that copies of primary material “may be retained for the purposes of scientific research, including for the verification of research results” (Title II, Article 3.2).

These data are often called born-digital data, meaning they have been user-generated digitally from the beginning. They have been collected for basic research purposes through the (now-discontinued) academic Twitter API, and it is no longer possible to collect a similar dataset for free.

While all the data accessible here were from an open source, using them as primary material in the humanities can be challenging. Researchers may face technical challenges related to data access, processing, enrichment, and use. That is why we have created an easy-to-use graphic interface that gives users full control of the material for basic research.

The objective of the interface is to enable easy access to and distribution of born-digital data for research.

Kindly note that the NTS does not offer access to all material that has been user-generated in the region, but it contains material in which the geolocation properties have been activated. Previous studies have estimated that the share of such material, in general, is low (see Graham et al., 2013).

All digital text corpora have limitations. Here, users are advised to refer to a classic article by the late Matti Rissanen (ICAME Journal 13, 16-19, 1989), in which he proposes three universal problems associated with the use of (historical) corpora. While all three are relevant here, the most important perhaps is the “God's truth fallacy.” Rissanen writes that an “authoritative corpus may easily create the erroneous impression that it gives an accurate reflection of the entire reality of the language it is intended to represent” (1989: 17). Likewise, it is important to keep in mind that the material here is a snapshot of languages in use in the Nordic region.

NTS - For Whom?

The NTS material is multilingual. We envisage that researchers from various fields, such as social sciences and cultural studies as well as sociolinguistics, dialectology, and so on could make use of this material, either as the sole primary data or additional material accompanying structured corpus data.

Please note that this interface is designed for a user who is interested in the first quick access to data. More advanced users might want to use the download function to get data to be further processed elsewhere.

NTS - What Can You Do With It?

The interface consists of four pages. The first is the Search page, which enables you to search for character strings and phrases in various languages within the material. You can refine your search using several metadata parameters (e.g., limit by date range, location, or network size; multiple parameters can be selected).

After searching, you'll see the outcome with basic frequency information. It includes a visualization tool that maps your search results. The textual results are displayed in a KWIC window.

You can also download the raw results as a CSV or XLS file for further processing. In the download function, you can select the metadata parameters you want to include. For example, if you're only interested in the text, the date the message was sent, and the location from where the messages were sent, you can select Text, Date, and City. This download function is especially useful for expert users who want to use the NTS as a quick entry point to data.

We have used the Spacy (v3.8.0) tokenization, POS tagging, and Lemmatization for the following languages: (see Spacy)

English: en_core_web_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 99%
Lemmatization accuracy: 98%

Finnish: fi_core_web_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 97%
Lemmatization accuracy: 86%

Swedish: sv_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 96%
Lemmatization accuracy: 96%

Norwegian: nb_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 97%
Lemmatization accuracy: 97%

Danish: da_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 96%
Lemmatization accuracy: 95%

Russian: ru_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 99%
Lemmatization accuracy: No information about the accuracy

Spanish: es_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 99%
Lemmatization accuracy: 97%

Catalan: ca_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 99%
Lemmatization accuracy: 98%

Chinese: zh_core_web_lg

Tokenization accuracy: 96%
Part-of-speech tags accuracy: No information about the accuracy
Lemmatization accuracy: No information about the accuracy

Dutch: nl_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 97%
Lemmatization accuracy: 95%

French: fr_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 98%
Lemmatization accuracy: 91%

German: nl_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 97%
Lemmatization accuracy: 95%

Greek: el_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 96%
Lemmatization accuracy: 90%

Italian: it_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 97%
Lemmatization accuracy: 98%

Japanese: ja_core_news_lg

Tokenization accuracy: 99%
Part-of-speech tags accuracy: 97%
Lemmatization accuracy: 97%

Korean: ko_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 95%
Lemmatization accuracy: 90%

Lithuanian: lt_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 95%
Lemmatization accuracy: 86%

Polish: pl_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 98%
Lemmatization accuracy: 94%

Portuguese: pt_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 97%
Lemmatization accuracy: 97%

Romanian: ro_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 94%
Lemmatization accuracy: 96%

Slovenian: sl_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 98%
Lemmatization accuracy: 96%

Ukrainian: uk_core_news_lg

Tokenization accuracy: 100%
Part-of-speech tags accuracy: 98%
Lemmatization accuracy: No information about the accuracy

Undetermined (Where the langauge of the tweet could not be determined by X/Twitter)

Tokenization accuracy: 99%

Other Tokenization, POS tagging, and Lemmatization tools is used for other languages:

Icelandic

The Search function of the NTS interface supports POSIX regular expressions. For a full guide with patterns, anchors, character classes, and practical corpus-linguistic examples, see the Help page.

NTS - Basic info and how to cite the material and the interface?

This corpus and the interface are result of interdisciplinary research between sociolinguists and computer scientists, and it has been funded by the Center for Data Intensive Sciences and Applications (DISA) at Linnaeus University in Sweden, by the Research Council of Finland and their FIRI funding for FIN-CLARIAH, and by the University of Eastern Finland.

If you use the NTS interface and use the findings in your publications, please cite our paper, which is available online (NB. a newer version is in the making and will be published soon):

Laitinen, Mikko, Jonas Lundberg, Magnus Levin & Rafael Martins. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data, Proc. of Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018, CEUR-WS.org, online CEUR-WS.org/Vol-2084/short10.pdf.

Contact:

General comments: Mikko [dot] Laitinen [at] uef [dot] fi
Technical comments: Mehrdad [dot] Salimi [at] uef [dot] fi
Technical comments: Masoud [dot] Fatemi [at] uef [dot] fi

References:

Graham, M., S.A. Hale & D. Gaffney. 2013. Where in the world are you? Geolocation and language identification in Twitter. The Professional Geographer 66, 568-578. (2013). doi 10.1080/00330124.2014.907699
Wilkinson, M. D. et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018. doi:10.1038/sdata.2016.18
PostgreSQL 16 Pattern Maching