Read an SPSS Data File
read.spss.Rdread.spss reads a file stored by the SPSS save or
export commands.
This was orignally written in 2000 and has limited support for changes in SPSS formats since (which have not been many).
Usage
read.spss(file, use.value.labels = TRUE, to.data.frame = FALSE,
max.value.labels = Inf, trim.factor.names = FALSE,
trim_values = TRUE, reencode = NA, use.missings = to.data.frame,
sub = ".", add.undeclared.levels = c("sort", "append", "no"),
duplicated.value.labels = c("append", "condense"),
duplicated.value.labels.infix = "_duplicated_", ...)Arguments
- file
character string: the name of the file or URL to read.
- use.value.labels
logical: convert variables with value labels into R factors with those levels? This is only done if there are at least as many labels as values of the variable (when values without a matching label are returned as
NA).- to.data.frame
logical: return a data frame?
- max.value.labels
logical: only variables with value labels and at most this many unique values will be converted to factors if
TRUE.- trim.factor.names
logical: trim trailing spaces from factor levels?
- trim_values
logical: should values and value labels have trailing spaces ignored when matching for
use.value.labels = TRUE?- reencode
logical: should character strings be re-encoded to the current locale. The default,
NA, means to do so in UTF-8 or latin-1 locales, only. Alternatively a character string specifying an encoding to assume for the file.- use.missings
logical: should information on user-defined missing values be used to set the corresponding values to
NA?- sub
character string: If not
NAit is used byiconvto replace any non-convertible bytes in character/factor input. Default is".". For back compatibility with foreign versions <= 0.8-68 usesub=NA.- add.undeclared.levels
character: specify how to handle variables with at least one value label and further non-missing values that have no value label (like a factor levels in R). For
"sort"(the default) it adds undeclared factor levels to the already declared levels (and labels) and sort them according to level, for"append"it appends undeclared factor levels to declared levels (and labels) without sorting, and for"no"this does not convert to factor in case of numeric SPSS levels (not labels), and still converts to factor if the SPSS levels are characters andto.data.frame=TRUE. For back compatibility with foreign versions <= 0.8-68 useadd.undeclared.levels="no"(not recommended as this may convert some values with missing corresponding value labels toNA).- duplicated.value.labels
character: what to do with duplicated value labels for different levels. For
"append"(the default), the first original value label is kept while further duplicated labels are renamed topaste0(label, duplicated.value.labels.infix, level), for"condense", all levels with identical labels are condensed into exactly the first of these levels in R. Back compatibility with foreign versions <= 0.8-68 is not given as R versions >= 3.4.0 no longer support duplicated factor labels.- duplicated.value.labels.infix
character: the infix used for labels of factor levels with duplicated value labels in SPSS (default
"_duplicated_") ifduplicated.value.labels="append".- ...
passed to
as.data.frameifto.data.frame = TRUE.
Value
A list (or optionally a data frame) with one component for each variable in the saved data set.
If what looks like a Windows codepage was recorded in the SPSS file,
it is attached (as a number) as attribute "codepage" to the
result.
There may be attributes "label.table" and
"variable.labels". Attribute "label.table" is a named
list of value labels with one element per variable, either NULL
or a named character vector. Attribute "variable.labels" is a
named character vector with names the short variable names and
elements the long names.
If there are user-defined missing values, there will be a attribute
"Missings". This is a named list with one list element per
variable. Each element has an element type, a length-one
character vector giving the type of missingness, and may also have an
element value with the values corresponding to missingness.
This is a complex subject (where the R and C source code for
read.spss is the main documentation), but the simplest cases
are types "one", "two" and "three" with a
corresponding number of (real or string) values whose labels can be
found from the "label.table" attribute. Other possibilities are
a finite or semi-infinite range, possibly plus a single value.
See also http://www.gnu.org/software/pspp/manual/html_node/Missing-Observations.html#Missing-Observations.
Details
This uses modified code from the PSPP project (http://www.gnu.org/software/pspp/ for reading the SPSS formats.
If the filename appears to be a URL (of schemes http:,
ftp: or https:) the URL is first downloaded to a
temporary file and then read. (https: is supported where
supported by download.file with its current default
method.)
Occasionally in SPSS, value labels will be added to some values of a
continuous variable (e.g. to distinguish different types of missing
data), and you will not want these variables converted to factors. By
setting max.value.labels you can specify that variables with a
large number of distinct values are not converted to factors even if
they have value labels.
If SPSS variable labels are present, they are returned as the
"variable.labels" attribute of the answer.
Fixed length strings (including value labels) are padded on the right
with spaces by SPSS, and so are read that way by R. The default
argument trim_values=TRUE causes trailing spaces to be ignored
when matching to value labels, as examples have been seen where the
strings and the value labels had different amounts of padding. See
the examples for sub for ways to remove trailing spaces
in character data.
URL https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers
provides a list of translations from Windows codepage numbers to
encoding names that iconv is likely to know about and so
suitable values for reencode. Automatic re-encoding is
attempted for apparent codepages of 200 or more in a UTF-8 or latin-1 locale:
some other high-numbered codepages can be re-encoded on most systems,
but the encoding names are platform-dependent (see
iconvlist).
Note
If SPSS value labels are converted to factors the underlying numerical codes will not in general be the same as the SPSS numerical values, since the numerical codes in R are always \(1,2,3,\dots\).
You may see warnings about the file encoding for SPSS save
files: it is possible such files contain non-ASCII character data
which need re-encoding. The most common occurrence is Windows codepage
1252, a superset of Latin-1. The encoding is recorded (as an integer)
in attribute "codepage" of the result if it looks like a
Windows codepage. Automatic re-encoding is done only in UTF-8 and latin-1
locales: see argument reencode.
See also
A different interface also based on the PSPP codebase is available in
package memisc: see its help for spss.system.file.
Examples
(sav <- system.file("files", "electric.sav", package = "foreign"))
#> [1] "/tmp/Rtmpfbu7cb/temp_libpath4527d3584a036/foreign/files/electric.sav"
dat <- read.spss(file=sav)
str(dat) # list structure with attributes
#> List of 13
#> $ CASEID : num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#> $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#> $ AGE : num [1:240] 40 49 43 50 43 50 45 47 53 49 ...
#> $ DBP58 : num [1:240] 70 87 89 105 110 88 70 79 102 99 ...
#> $ EDUYR : num [1:240] 16 11 12 8 NA 8 NA 9 12 14 ...
#> $ CHOL58 : num [1:240] 321 246 262 275 301 261 212 372 216 251 ...
#> $ CGT58 : num [1:240] 0 60 0 15 25 30 0 30 0 10 ...
#> $ HT58 : num [1:240] 68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#> $ WT58 : num [1:240] 190 204 162 152 148 142 196 193 172 162 ...
#> $ DAYOFWK : Factor w/ 8 levels "SUNDAY","MONDAY",..: 8 5 7 4 2 1 8 1 3 5 ...
#> $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#> $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#> $ CHD : num [1:240] 1 1 1 1 1 1 1 1 1 1 ...
#> - attr(*, "label.table")=List of 13
#> ..$ CASEID : NULL
#> ..$ FIRSTCHD: Named num [1:5] 6 5 3 2 1
#> .. ..- attr(*, "names")= chr [1:5] "OTHER CHD" "FATAL MI" "NONFATALMI" "SUDDEN DEATH" ...
#> ..$ AGE : NULL
#> ..$ DBP58 : NULL
#> ..$ EDUYR : NULL
#> ..$ CHOL58 : NULL
#> ..$ CGT58 : NULL
#> ..$ HT58 : NULL
#> ..$ WT58 : NULL
#> ..$ DAYOFWK : Named num [1:8] 9 7 6 5 4 3 2 1
#> .. ..- attr(*, "names")= chr [1:8] "MISSING" "SATURDAY" "FRIDAY" "THURSDAY" ...
#> ..$ VITAL10 : Named num [1:2] 1 0
#> .. ..- attr(*, "names")= chr [1:2] "DEAD" "ALIVE"
#> ..$ FAMHXCVR: Named chr [1:2] "Y " "N "
#> .. ..- attr(*, "names")= chr [1:2] "YES" "NO"
#> ..$ CHD : NULL
#> - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#> ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...
#> - attr(*, "missings")=List of 13
#> ..$ CASEID :List of 1
#> .. ..$ type: chr "none"
#> ..$ FIRSTCHD:List of 1
#> .. ..$ type: chr "none"
#> ..$ AGE :List of 1
#> .. ..$ type: chr "none"
#> ..$ DBP58 :List of 1
#> .. ..$ type: chr "none"
#> ..$ EDUYR :List of 1
#> .. ..$ type: chr "none"
#> ..$ CHOL58 :List of 1
#> .. ..$ type: chr "none"
#> ..$ CGT58 :List of 1
#> .. ..$ type: chr "none"
#> ..$ HT58 :List of 1
#> .. ..$ type: chr "none"
#> ..$ WT58 :List of 1
#> .. ..$ type: chr "none"
#> ..$ DAYOFWK :List of 2
#> .. ..$ type : chr "one"
#> .. ..$ value: num 9
#> ..$ VITAL10 :List of 1
#> .. ..$ type: chr "none"
#> ..$ FAMHXCVR:List of 1
#> .. ..$ type: chr "none"
#> ..$ CHD :List of 1
#> .. ..$ type: chr "none"
dat <- read.spss(file=sav, to.data.frame=TRUE)
str(dat) # now a data.frame
#> 'data.frame': 240 obs. of 13 variables:
#> $ CASEID : num 13 30 53 84 89 102 117 132 151 153 ...
#> $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#> $ AGE : num 40 49 43 50 43 50 45 47 53 49 ...
#> $ DBP58 : num 70 87 89 105 110 88 70 79 102 99 ...
#> $ EDUYR : num 16 11 12 8 NA 8 NA 9 12 14 ...
#> $ CHOL58 : num 321 246 262 275 301 261 212 372 216 251 ...
#> $ CGT58 : num 0 60 0 15 25 30 0 30 0 10 ...
#> $ HT58 : num 68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#> $ WT58 : num 190 204 162 152 148 142 196 193 172 162 ...
#> $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#> $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#> $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#> $ CHD : num 1 1 1 1 1 1 1 1 1 1 ...
#> - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#> ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...
### Now we use an example file that is not very well structured and
### hence may need some special treatment with appropriate argument settings.
### Expect lots of warnings as value labels (corresponding to R factor labels) are uncomplete,
### and an unsupported long string variable is present in the data
(sav <- system.file("files", "testdata.sav", package = "foreign"))
#> [1] "/tmp/Rtmpfbu7cb/temp_libpath4527d3584a036/foreign/files/testdata.sav"
### Examples for add.undeclared.levels:
## add.undeclared.levels = "sort" (default):
x.sort <- read.spss(file=sav, to.data.frame = TRUE)
#> Warning: /tmp/Rtmpfbu7cb/temp_libpath4527d3584a036/foreign/files/testdata.sav: Very long string record(s) found (record type 7, subtype 14), each will be imported in consecutive separate variables
#> Warning: Duplicated levels in factor factor_n_duplicated: A
#> Warning: Undeclared level(s) 2, 3, 4 added in variable: factor_n_undeclared
#> Warning: Undeclared level(s) 0, 3 added in variable: factor_n_undeclared2
#> Warning: Undeclared level(s) ä, ö added in variable: factor_s_duplicated
#> Warning: Duplicated levels in factor factor_s_duplicated: A
#> Warning: Undeclared level(s) perhaps added in variable: factor_s_undeclared
## add.undeclared.levels = "append":
x.append <- read.spss(file=sav, to.data.frame = TRUE,
add.undeclared.levels = "append")
#> Warning: /tmp/Rtmpfbu7cb/temp_libpath4527d3584a036/foreign/files/testdata.sav: Very long string record(s) found (record type 7, subtype 14), each will be imported in consecutive separate variables
#> Warning: Duplicated levels in factor factor_n_duplicated: A
#> Warning: Undeclared level(s) 2, 3, 4 added in variable: factor_n_undeclared
#> Warning: Undeclared level(s) 0, 3 added in variable: factor_n_undeclared2
#> Warning: Undeclared level(s) ä, ö added in variable: factor_s_duplicated
#> Warning: Duplicated levels in factor factor_s_duplicated: A
#> Warning: Undeclared level(s) perhaps added in variable: factor_s_undeclared
## add.undeclared.levels = "no":
x.no <- read.spss(file=sav, to.data.frame = TRUE,
add.undeclared.levels = "no")
#> Warning: /tmp/Rtmpfbu7cb/temp_libpath4527d3584a036/foreign/files/testdata.sav: Very long string record(s) found (record type 7, subtype 14), each will be imported in consecutive separate variables
#> Warning: Duplicated levels in factor factor_n_duplicated: A
levels(x.sort$factor_n_undeclared)
#> [1] "strongly disagree" "2" "3"
#> [4] "4" "strongly agree"
levels(x.append$factor_n_undeclared)
#> [1] "strongly disagree" "strongly agree" "2"
#> [4] "3" "4"
str(x.no$factor_n_undeclared)
#> num [1:5] 1 2 4 3 1
#> - attr(*, "value.labels")= Named num [1:2] 5 1
#> ..- attr(*, "names")= chr [1:2] "strongly agree" "strongly disagree"
### Examples for duplicated.value.labels:
## duplicated.value.labels = "append" (default)
x.append <- read.spss(file=sav, to.data.frame=TRUE)
#> Warning: /tmp/Rtmpfbu7cb/temp_libpath4527d3584a036/foreign/files/testdata.sav: Very long string record(s) found (record type 7, subtype 14), each will be imported in consecutive separate variables
#> Warning: Duplicated levels in factor factor_n_duplicated: A
#> Warning: Undeclared level(s) 2, 3, 4 added in variable: factor_n_undeclared
#> Warning: Undeclared level(s) 0, 3 added in variable: factor_n_undeclared2
#> Warning: Undeclared level(s) ä, ö added in variable: factor_s_duplicated
#> Warning: Duplicated levels in factor factor_s_duplicated: A
#> Warning: Undeclared level(s) perhaps added in variable: factor_s_undeclared
## duplicated.value.labels = "condense"
x.condense <- read.spss(file=sav, to.data.frame=TRUE,
duplicated.value.labels = "condense")
#> Warning: /tmp/Rtmpfbu7cb/temp_libpath4527d3584a036/foreign/files/testdata.sav: Very long string record(s) found (record type 7, subtype 14), each will be imported in consecutive separate variables
#> Warning: Duplicated levels in factor factor_n_duplicated: A
#> Warning: Undeclared level(s) 2, 3, 4 added in variable: factor_n_undeclared
#> Warning: Undeclared level(s) 0, 3 added in variable: factor_n_undeclared2
#> Warning: Undeclared level(s) ä, ö added in variable: factor_s_duplicated
#> Warning: Duplicated levels in factor factor_s_duplicated: A
#> Warning: Undeclared level(s) perhaps added in variable: factor_s_undeclared
levels(x.append$factor_n_duplicated)
#> [1] "A" "A_duplicated_2" "B"
levels(x.condense$factor_n_duplicated)
#> [1] "A" "B"
as.numeric(x.append$factor_n_duplicated)
#> [1] 1 1 2 NA 3
as.numeric(x.condense$factor_n_duplicated)
#> [1] 1 1 1 NA 2
## Long Strings (>255 chars) are imported in consecutive separate variables
## (see warning about subtype 14):
x <- read.spss(file=sav, to.data.frame=TRUE, stringsAsFactors=FALSE)
#> Warning: /tmp/Rtmpfbu7cb/temp_libpath4527d3584a036/foreign/files/testdata.sav: Very long string record(s) found (record type 7, subtype 14), each will be imported in consecutive separate variables
#> Warning: Duplicated levels in factor factor_n_duplicated: A
#> Warning: Undeclared level(s) 2, 3, 4 added in variable: factor_n_undeclared
#> Warning: Undeclared level(s) 0, 3 added in variable: factor_n_undeclared2
#> Warning: Undeclared level(s) ä, ö added in variable: factor_s_duplicated
#> Warning: Duplicated levels in factor factor_s_duplicated: A
#> Warning: Undeclared level(s) perhaps added in variable: factor_s_undeclared
cat.long.string <- function(x, w=70) cat(paste(strwrap(x, width=w), "\n"))
## first part: x$string_500:
cat.long.string(x$string_500)
#> A wonderful serenity has taken possession of my entire soul, like
#> these sweet mornings of spring which I enjoy with my whole heart. I
#> am alone, and feel the charm of existence in this spot, which was
#> created for the bliss of souls like mine. I am so happy
#>
#> Far far away, behind the word mountains, far from the countries
#> Vokalia and Consonantia, there live the blind texts. Separated they
#> live in Bookmarksgrove right at the coast of the Semantics, a large
#> language ocean. A small river named Duden flows by thei
#>
#> abc def ghi jkl mno pqrs tuv wxyz ABC DEF GHI JKL MNO PQRS TUV WXYZ
#> !"§ $%& /() =?* '<> #|; ²³~ @`´ ©«» ¤¼× {} abc def ghi jkl mno pqrs
#> tuv wxyz ABC DEF GHI JKL MNO PQRS TUV WXYZ !"§ $%& /() =?* '<> #|;
#> ²³~ @`´ ©«» ¤¼× {} abc def ghi j
## second part: x$STRIN0:
cat.long.string(x$STRIN0)
#> , my dear friend, so absorbed in the exquisite sense of mere tranquil
#> existence, that I neglect my talents. I should be incapable of
#> drawing a single stroke at the present moment; and yet I feel that I
#> never was a greater artist than now.
#>
#> r place and supplies it with the necessary regelialia. It is a
#> paradisematic country, in which roasted parts of sentences fly into
#> your mouth.
#>
#> kl mno pqrs tuv wxyz ABC DEF GHI JKL MNO PQRS TUV WXYZ !"§ $%& /()
#> =?* '<> #|; ²³~ @`´ ©«» ¤¼× {} abc def ghi jkl mno pqrs tuv wxyz ABC
#> DEF GHI JKL MNO PQRS TUV WXYZ !"§ $%& /() =?* '<> #|; ²³~ @`´ ©«» ¤¼×
#> {} abc def ghi jkl
## complete long string:
long.string <- apply(x[,c("string_500", "STRIN0")], 1, paste, collapse="")
cat.long.string(long.string)
#> A wonderful serenity has taken possession of my entire soul, like
#> these sweet mornings of spring which I enjoy with my whole heart. I
#> am alone, and feel the charm of existence in this spot, which was
#> created for the bliss of souls like mine. I am so happy, my dear
#> friend, so absorbed in the exquisite sense of mere tranquil
#> existence, that I neglect my talents. I should be incapable of
#> drawing a single stroke at the present moment; and yet I feel that I
#> never was a greater artist than now.
#>
#> Far far away, behind the word mountains, far from the countries
#> Vokalia and Consonantia, there live the blind texts. Separated they
#> live in Bookmarksgrove right at the coast of the Semantics, a large
#> language ocean. A small river named Duden flows by their place and
#> supplies it with the necessary regelialia. It is a paradisematic
#> country, in which roasted parts of sentences fly into your mouth.
#>
#> abc def ghi jkl mno pqrs tuv wxyz ABC DEF GHI JKL MNO PQRS TUV WXYZ
#> !"§ $%& /() =?* '<> #|; ²³~ @`´ ©«» ¤¼× {} abc def ghi jkl mno pqrs
#> tuv wxyz ABC DEF GHI JKL MNO PQRS TUV WXYZ !"§ $%& /() =?* '<> #|;
#> ²³~ @`´ ©«» ¤¼× {} abc def ghi jkl mno pqrs tuv wxyz ABC DEF GHI JKL
#> MNO PQRS TUV WXYZ !"§ $%& /() =?* '<> #|; ²³~ @`´ ©«» ¤¼× {} abc def
#> ghi jkl mno pqrs tuv wxyz ABC DEF GHI JKL MNO PQRS TUV WXYZ !"§ $%&
#> /() =?* '<> #|; ²³~ @`´ ©«» ¤¼× {} abc def ghi jkl