4 March 2025
Ethan Te Puni
[FYI request #29991 email]
Tēnā koe Ethan
Official Information Act (OIA) request: Use of tohutō in Stats NZ databases On 7 February 2025, you contacted Stats NZ requesting, under the Official Information Act
1982 (the Act) the following information:
•
I would like to know how characters that are not standard in the English alphabet
are parsed and handled in Statistics New Zealand databases. I would specifically
like to know how tohutō are parsed in the Integrated Data Infrastructure.
The character encoding system in Stats NZ’s production processing SQL server uses the
Windows collation Latin1_General_CI_AS, which in turn uses the Windows-1252 codepage
for character encoding. The process of character encoding assigns a numerical value to
characters, with the specific codepage or encoding used determining which characters can
be presented. The Windows-1252 codepage allows for 256 different characters, which
includes some non-English characters, but not those with special diacritics like tohutō.
The Latin1_General_CI_AS collation is a setting used in SQL Server to define the rules for
sorting and comparing strings of character data. In this collation:
• "Latin1" specifies the character set used, which is based on the Latin alphabet
• "General" indicates general-purpose rules for sorting and comparing characters
• "CI" stands for case-insensitive, meaning that uppercase and lowercase letters are
treated as equivalent
• "AS" stands for accent-sensitive, meaning that characters with accents are treated as
distinct from those without.
When it comes to tohutō, the "AS" setting treats characters with macrons as different from
those without macrons. For example, 'ā' (a with tohutō) is considered different from 'a'
(without tohutō). This ensures that accented characters are accurately represented and
distinguished in data processes.
The majority of Stats NZ’s data suppliers do not currently provide data containing such
special characters. When files with these characters are read into SQL, they are interpreted
[email address]
toll-free 0508 525 525
stats.govt.nz
8 Willis Street, Wellington
PO Box 2922, Wellington 6140

as characters within the Windows-1252 codepage, displaying the equivalent character for
that numerical value. The presented character depends on the original file's character
encoding, usually resulting in a non-English character.
In the Integrated Data Infrastructure (IDI), all non-English characters present in name fields
are replaced with English characters prior to Stats NZ IDI linking processes. For example, ‘Ä’
will be replaced with ‘A’. IDI linking is done between the ‘spine’ (a combination of births, visa,
and tax information intended to cover New Zealand’s ever-resident population), and ‘nodes’
which are all other datasets, e.g. data from other government agencies. Non-English
characters are replaced to help with linking quality - if a person’s spine record contains non-
English characters but the node record does not (and vice versa), it will affect the likelihood
that the two records will be linked. This is important given the relatively small number of
datasets we receive containing tohutō or other non-English characters.
Finally, identifying information like names and addresses are removed from the clean, de -
identified data that IDI users see. Non-English characters may still be present in other text
fields, like place names. Similarly, some auxiliary geographic tables sourced from other Stats
NZ databases may contain tohutō.
Should you wish to discuss this response with us, please feel free to contact Stats NZ at:
[email address]. If you are not satisfied with this response, you have the right to seek an investigation and review
by the Ombudsman. Information about how to make a complaint is available at
www.ombudsman.parliament.nz or 0800 802 602.
It is Stats NZ’s policy to proactively release its responses to official information requests where
possible. This letter, with your personal details removed, will be published on the Stats NZ
website. Publishing responses creates greater openness and transparency of government
decision-making and helps better inform public understanding of the reasons for decisions.
Nāku noa, nā
Matt Phimmavanh
Principal Advisor – Executive & Government Relations | Office of the Chief Executive
Stats NZ Tatauranga Aotearoa
stats.govt.nz
[email address]
toll-free 0508 525 525
stats.govt.nz
8 Willis Street, Wellington
PO Box 2922, Wellington 6140