How to use NTS interface
The Search function of the NTS interface supports POSIX regular expressions. The most useful patterns for corpus research are explained below.
Note: You can choose regular expression search in the SEARCH TYPE box.
"A regular expression is allowed to match anywhere within a string, unless the regular expression is explicitly anchored to the beginning or end of the string." (source)
Basic Patterns
| Pattern | Description | Example | Matches |
|---|---|---|---|
. |
Any single character | r.n |
run, ran, r1n |
* |
Zero or more of the preceding element | .*able |
able, table, capable |
+ |
One or more of the preceding element | .+able |
table, capable — but not able |
? |
Zero or one of the preceding element (makes it optional) | colou?r |
color, colour |
{m,n} |
Between m and n repetitions of the preceding element | .{1,4}able |
cable, table, capable |
| |
Alternation (OR) | was|were |
was, were |
( … ) |
Group sub-expression (treated as a single unit) | (pre|re)view |
preview, review |
[ … ] |
Match any one character in the set; use - for ranges |
[aeiou] |
any single vowel |
[^ … ] |
Match any one character not in the set | [^aeiou] |
any single non-vowel character |
Anchors & Boundaries
| Pattern | Description | Example | Matches |
|---|---|---|---|
^ |
Beginning of string | ^the |
"the" only at the start of a token |
$ |
End of string | ing$ |
tokens ending in "ing" |
^word$ |
Exact match (anchored at both ends) | ^the$ |
"the" and nothing else |
Character Class Shorthands
| Shorthand | Description | Equivalent |
|---|---|---|
\d |
Any digit | [[:digit:]] i.e. [0-9] |
\D |
Any non-digit | [^[:digit:]] |
\w |
Any word character (letter, digit, underscore) | [[:word:]] i.e. [A-Za-z0-9_] |
\W |
Any non-word character | [^[:word:]] |
\s |
Any whitespace character | [[:space:]] |
\S |
Any non-whitespace character | [^[:space:]] |
POSIX Character Classes (use inside brackets)
| Class | Description | Example |
|---|---|---|
[[:alpha:]] |
Any letter | [[:alpha:]]+ — one or more letters |
[[:digit:]] |
Any digit (0–9) | [[:digit:]]{4} — exactly four digits (e.g. a year) |
[[:lower:]] |
Any lower-case letter | ^[[:lower:]]+$ — all-lowercase tokens |
[[:upper:]] |
Any upper-case letter | ^[[:upper:]]+$ — all-uppercase tokens (e.g. acronyms) |
[[:punct:]] |
Any punctuation character | [[:punct:]] — matches !, ?, , etc. |
[[:alnum:]] |
Any letter or digit | ^[[:alnum:]]+$ — tokens with no punctuation |
Practical Examples for Corpus Linguistics
| Task | Pattern | Explanation |
|---|---|---|
| Words ending in -ing | .*ing$ |
Any characters followed by "ing" at end |
| Words starting with un- | ^un.* |
Starts with "un" followed by anything |
| Spelling variation: color/colour | colou?r |
The "u" is optional |
| Suffix alternation: -ise/-ize | .*i[sz]e$ |
Either "s" or "z" before final "e" |
| Personal pronouns | ^(I|me|my|mine)$ |
Exact match of any listed form |
| Contracted forms with apostrophe | .*n't$ |
don't, won't, can't, etc. |
| Words of exactly 3 letters | ^[[:alpha:]]{3}$ |
Exactly three letters, anchored |
| Tokens containing digits | \d |
Any token with at least one digit |
| ALL CAPS tokens (e.g. acronyms) | ^[[:upper:]]{2,}$ |
Two or more uppercase letters only |
| Verb forms: go/goes/going/gone/went | ^(go|goes|going|gone|went)$ |
Alternation with anchors for exact match |
| Phrasal verb particle in context position | ^(up|out|off|on|in|down|away|back)$ |
Common particles (use in context filter with regex toggle) |
| Reduplicated forms | ^(.+)\1$ |
Back reference: same sequence repeated (e.g. "mama", "byebye") |
You can also toggle regex mode for individual context positions (L1–L5, R1–R5), allowing you to use patterns like ^(the|a|an)$ to match determiners in a specific slot.
For the full PostgreSQL regular expression reference, see the documentation. If you need help constructing a complex pattern, please contact us.
Contact:
- General comments: Mikko [dot] Laitinen [at] uef [dot] fi
- Technical comments: Mehrdad [dot] Salimi [at] uef [dot] fi
- Technical comments: Masoud [dot] Fatemi [at] uef [dot] fi