Convert a column of unique but restricted IDs into a set of new IDs using secure (SHA-2) hashing algorithm. Users have the option of saving a crosswalk between the old and new IDs in case observations need to reidentified at a later date.

deid_dua(
  df,
  id_col = NULL,
  new_id_name = "id",
  id_length = 64,
  existing_crosswalk = NULL,
  write_crosswalk = FALSE,
  crosswalk_filename = NULL
)

Arguments

df

Data frame

id_col

Column name with IDs to be replaced. By default it is NULL and uses the value set by the id_column argument in set_dua_level() function.

new_id_name

New hashed ID column name, which must be different from old name.

id_length

Length of new hashed ID; cannot be fewer than 12 characters (default is 64 characters).

existing_crosswalk

File name of existing crosswalk. If existing crosswalk is used, then new_id_name, id_length, id_length, and crosswalk_name will be determined by the already existing crosswalk. Arguments given for these values will be ignored.

write_crosswalk

Write crosswalk between old ID and new hash ID to console (unless crosswalk_name is given value).

crosswalk_filename

Name of crosswalk file with path; defaults to generic name with current date (YYYYMMDD) appended.

Examples

## -------------- ## Setup ## -------------- ## set DUA crosswalk dua_cw <- system.file('extdata', 'dua_cw.csv', package = 'duawranglr') set_dua_cw(dua_cw)
#> -- duawranglr note ------------------------------------------------------------- #> DUA crosswalk has been set!
## read in data admin <- system.file('extdata', 'admin_data.csv', package = 'duawranglr') df <- read_dua_file(admin) ## -------------- ## show identified data df
#> # A tibble: 9 x 10 #> sid sname dob gender raceeth tid tname zip mathscr readscr #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 000-00-0001 Schaefer 19900114 0 2 1 Smith 22906 515 496 #> 2 000-00-0002 Hodges 19900225 0 1 1 Smith 22906 488 489 #> 3 000-00-0003 Kirby 19900305 0 4 1 Smith 22906 522 498 #> 4 000-00-0004 Estrada 19900419 0 3 1 Smith 22906 516 524 #> 5 000-00-0005 Nielsen 19900530 1 2 1 Smith 22906 483 509 #> 6 000-00-0006 Dean 19900621 1 1 2 Brown 22906 503 523 #> 7 000-00-0007 Hickman 19900712 1 1 2 Brown 22906 539 509 #> 8 000-00-0008 Bryant 19900826 0 2 2 Brown 22906 499 490 #> 9 000-00-0009 Lynch 19900902 1 3 2 Brown 22906 499 493
## deidentify df <- deid_dua(df, id_col = 'sid', new_id_name = 'id', id_length = 12) ## show deidentified data df
#> # A tibble: 9 x 10 #> id sname dob gender raceeth tid tname zip mathscr readscr #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 a14856cc5d00 Schaefer 199001… 0 2 1 Smith 22906 515 496 #> 2 a141ce13114a Hodges 199002… 0 1 1 Smith 22906 488 489 #> 3 de520f632a2c Kirby 199003… 0 4 1 Smith 22906 522 498 #> 4 889d833f94ed Estrada 199004… 0 3 1 Smith 22906 516 524 #> 5 2993f4bda3cd Nielsen 199005… 1 2 1 Smith 22906 483 509 #> 6 86c8de9a8d63 Dean 199006… 1 1 2 Brown 22906 503 523 #> 7 cdb300787c0b Hickman 199007… 1 1 2 Brown 22906 539 509 #> 8 ef91ae029e71 Bryant 199008… 0 2 2 Brown 22906 499 490 #> 9 0fb2736cec2c Lynch 199009… 1 3 2 Brown 22906 499 493
if (FALSE) { ## save crosswalk between old and new ids for future deid_dua(df, write_crosswalk = TRUE) ## use existing crosswalk (good for panel datasets that need link) deid_dua(df, existing_crosswalk = './crosswalk/master_crosswalk.csv') }