About the Nordic Tweet Stream

The Nordic Tweet Stream - what it is, and what it isn't?

The Nordic Tweet Stream (NTS) is a monitor corpus of geolocated tweets and associated metadata from the Nordic region. It offers access to very large-scale data covering about a decade, from January 2013 to May 2023 when this social networking platform was known for its old name. .

The corpus contains approximately 799 million words from over 888,000 user accounts. Altogether, there is material in 73 different languages, with the largest languages being Swedish (approximately 31%), English (approximately 26%), and Finnish (approximately 13%). Detailed information on the material can be found in the Statistics section.

The NTS builds on the idea that access to very large social media data for research cannot be left solely to social media companies. We need parallel storage of these valuable cultural heritage data so that they are available for researchers. We operate according to the FAIR Data Principles (see Fair). The guiding principles of FAIR aim at making data findable, accessible, interoperable, and reusable (Wilkinson et al., 2016). Currently, the Digital Single Market directive (2019/790) of the European Union makes text and data mining for scientific research purposes possible. It ensures that copies of primary material “may be retained for the purposes of scientific research, including for the verification of research results” (Title II, Article 3.2).

These data are often called born-digital data, meaning they have been user-generated digitally from the beginning. They have been collected for basic research purposes through the (now-discontinued) academic Twitter API, and it is no longer possible to collect a similar dataset for free.

While all the data accessible here were from an open source, using them as primary material in the humanities can be challenging. Researchers may face technical challenges related to data access, processing, enrichment, and use. That is why we have created an easy-to-use graphic interface that gives users full control of the material for basic research.

The objective of the interface is to enable easy access to and distribution of born-digital data for research.

Kindly note that the NTS does not offer access to all material that has been user-generated in the region, but it contains material in which the geolocation properties have been activated. Previous studies have estimated that the share of such material, in general, is low (see Graham et al., 2013).

All digital text corpora have limitations. Here, users are advised to refer to a classic article by the late Matti Rissanen (ICAME Journal 13, 16-19, 1989), in which he proposes three universal problems associated with the use of (historical) corpora. While all three are relevant here, the most important perhaps is the “God's truth fallacy.” Rissanen writes that an “authoritative corpus may easily create the erroneous impression that it gives an accurate reflection of the entire reality of the language it is intended to represent” (1989: 17). Likewise, it is important to keep in mind that the material here is a snapshot of languages in use in the Nordic region.

NTS - For Whom?

The NTS material is multilingual. We envisage that researchers from various fields, such as social sciences and cultural studies as well as sociolinguistics, dialectology, and so on could make use of this material, either as the sole primary data or additional material accompanying structured corpus data.

Please note that this interface is designed for a user who is interested in the first quick access to data. More advanced users might want to use the download function to get data to be further processed elsewhere.

NTS - What Can You Do With It?

The interface consists of two pages. The first is the Search page, which enables you to search for character strings and phrases in various languages within the material. You can refine your search using several metadata parameters (e.g., limit by date range, location, language, or network size; multiple parameters can be selected).

The second page is the Results page, where you'll see the outcome with basic frequency information. It includes a visualization tool that maps your search results. The textual results are displayed in a KWIC window.

You can also download the raw results as a CSV or XLS file for further processing. In the download function, you can select the metadata parameters you want to include. For example, if you're only interested in the text, the date the message was sent, and the location from where the messages were sent, you can select Text, Date, and City. This download function is especially useful for expert users who want to use the NTS as a quick entry point to data.

We have several ideas for functionalities to be included on the Results page, and we constantly strive to improve it. If there are functionalities you'd like to see, please let us know, and we'll be happy to try to include them.

The Search function of the NTS interface supports the following regular expressions (NB. These can be modified, so please use the feedback form to let us know what functions you'd like to have). The most important regular expressions are explained here:

Note: You can only use Regular Expressions in the Refined search tab.

Character Description Example Finds
. Matches any character. r.n run, ran
* Repeat the preceding character zero or more times .*able able, table, capable, etc.
+ Repeat the preceding character one or more times .+able table, capable, adorable, NOT able
? Repeat the preceding character zero or one times. Often used to make the preceding character optional runs? run, runs
{} Minimum and maximum number of times the preceding character can repeat some.{5}s
.{1,4}able
somethings capable
( … ) Forms a group. You can use a group to treat part of the expression as a single character abc(def)? abc abcdef but not abcd
[ … ] Match one of the characters in the brackets. Inside the brackets, - indicates a range unless - is the first character or escaped [abc] matches a, b, c
| OR operator. The match will succeed if the longest pattern on either the left side OR the right side matches was|were
I|me|my|mine
both was and were finds all: I, me, my, mine
\w\s\w Returns a match where items ending in 0+ characters are followed by a white space and 0+characters .*ing\sup coming up, going up
Some examples

These regular expressions can be combined with each other. Note nevertheless that when it comes to the POS tagging, only English tweets have so far been tagged. So, combining reg ex and POS tags only works when searching the English material. For more signs, please check here. If you have a more complicated expression and you need help, just send us an email and we will try to help with that.

Regular expressions are limited to 1,000 characters. For more information about Regular expressions, visit here.

NTS - Basic info and how to cite the material and the interface?

This corpus and the interface are result of interdisciplinary research between sociolinguists and computer scientists, and it has been funded by the Center for Data Intensive Sciences and Applications (DISA) at Linnaeus University in Sweden, by the Research Council of Finland and their FIRI funding for FIN-CLARIAH, and by the University of Eastern Finland.

If you use the NTS interface and use the findings in your publications, please cite our paper, which is available online (NB. a newer version is in the making and will be published soon):

Contact:

References: