The Nordic Tweet Stream (NTS) is a monitor corpus of geolocated tweets and associated metadata from the Nordic region. It builds on the idea that access to very large social media data for research cannot be left at the mercy of social media companies only. We need parallel storage of these valuable cultural heritage data so that they would be available for researchers. The data were collected by the (now-discontinued) academic Twitter API.
These data are what is often called born-digital data, so that they have been user-generated digitally from the beginning. They have been collected for basic research purposes. While all the data accessible here are from an open source, using them as primary material in the humanities can be challenging, since researchers may face technical challenges related to data access, processing, enrichment and use. The objective of this digital interface is to enable easy access to and distribution of born-digital data for basic research. We operate according to the FAIR Data Principles (see Fair). The guiding principles of FAIR aim at making data findable, accessible, interoperable, and reusable (Wilkinson et al. 2016).
Currently, the Digital Single Markets directive (2019/790) of the European Union makes text and data mining for the purposes of scientific research possible. It ensures that copies of primary material “may be retained for the purposes of scientific research, including for the verification of research results” (Title II, Article 3.2).
The NTS offers data access from January 2013 to May 2023. We'll gradually make all the data (c. 799 million words from over 888,000 user accounts) available once we sort out a few technical details. Altogether, there is material in 73 different languages in the dataset. The largest languages are Swedish (c. 31 %), English (c. 26 %) and Finnish (c. 13 %). Detailed information of the material is found in Statistics.
The NTS does not offer access to all material that has been user generated in the region, but it contains material in which the geolocation properties have been activated. Previous studies have estimated that the share of such material in general is low (see Graham et al. 2013).
All digital text corpora have limitations. Here, the users are advised to by now a classic article by late Matti Rissanen (ICAME Journal 13, 16-19, 1989), in which he proposes three universal problems associated with the use of (historical) corpora. While all three are relevant here, the most important perhaps is the “God's truth fallacy”. Rissanen writes that an “authoritative corpus may easily create the erroneous impression that it gives an accurate reflection of the entire reality of the language it is intended to represent” (1989: 17). Likewise, it is important to keep in mind that the material here is a snapshot of languages in use in the Nordic region.
The NTS material is multilingual. We envisage that researchers from various fields, such as sociolinguistics, dialectology, social sciences, and cultural studies, and so on could make use of this material, either as the sole primary data or additional material accompanying structured corpus data.
Please note that this interface is designed for a user who is interested in the first quick access to data. More advanced users might want to use the download function to get data to be further processed elsewhere.
The interface consists of two pages. The first is the Search page that enables you to – well – search for character strings and phrases in all the different languages in the material. You can also restrict your searches using a few metadata parameters in the search page (e.g., limit by date range, limit by location, limit by language, or limit by network size; it is possible to select several restricting parameters).
The second page is the Results page. You'll see the outcome with some basic frequency information. It also includes a visualization tool that visualizes your search on a map. The textual results show the output in a KWIC window.
You can also download the raw results as a csv-file or an xls-file for further processing. In the download function, you can select those metadata parameters that you want to have included in the field. So, if you're only interested in the text and the date of sending the message as well as the location from where the messages were sent, just select Text, Date and City. This download function could be especially useful for more expert users who want to use the NTS as the first and quick point of entry to data.
We have a number of ideas for functionalities to be included in the Results page, and we constantly try to improve it. If there are functionalities that you'd like to have included, please let us know, and we'll be happy to try to include them.
The Search function of NTS interface supports the following regular expressions (NB. These can be modified, so please use the feedback form to let us know what kind of functions you'd like to have). The most important signs are explained here:
Character | Description | Example | Finds |
---|---|---|---|
Character-level query syntax | |||
. | Any character (except newline character) | r.n | run, ran |
* | Zero or more occurrences | .*able | able, table, capable, etc. |
+ | One or more occurrences | \S+able | table, capable, adorable, NOT able |
? | Zero or one occurrence | run? | run, runs |
[[:<:]] and [[:>:]] | Returns a match where the specified characters are at the beginning or at the end of a word | [[:<:]]tr ain[[:>:]] |
train, training, train, NOT training |
{} | Exactly the specified number of occurrences | some.{5}s .{3}able |
somethings capable |
| | Either or | was|were I|me|my|mine |
both was and were finds all: I, me, my, mine |
\w\s\w | Returns a match where items ending in 0+ characters are followed by a white space and 0+characters | .*ing\sup | coming up, going up |
Message-level query syntax | |||
^ | Messages that start with | ^some | “Someone stole my purse.” |
$ | Messages that end with | ing$ | “My purse is missing |
These regular expressions can be combined with each other. Note nevertheless that when it comes to the POS tagging, only English tweets have so far been tagged. So, combining reg ex and POS tags only works when searching the English material. For more signs, please check here. If you have a more complicated expression and you need help, just send us an email and we will try to help with that.
This corpus and the interface are result of interdisciplinary research between sociolinguists and computer scientists, and it has been funded by the Center for Data Intensive Sciences and Applications (DISA) at Linnaeus University in Sweden, by the Research Council of Finland and their FIRI funding for FIN-CLARIAH, and by the University of Eastern Finland.
If you use the NTS interface and use the findings in your publications, please cite our paper, which is available online (NB. a newer version is in the making and will be published soon):
Please contact Prof. Mikko Laitinen (general comments) and MSc. Mehrdad Salimi and MSc. Masoud Fatemi (for technicalities) for any comments or questions on the corpus and the interface.
First name [dot] last name [at] uef.fi