Duplicates in R

List, tag, report duplicates in R like STATA

Data Wrangling

Author

Myo Minn Oo

Published

June 15, 2023

Modified

July 13, 2024

1 Replicate examples on UCLA’s STATA tutorial in R

Citation: HOW CAN I DETECT DUPLICATE OBSERVATIONS? | STATA FAQ. UCLA: Statistical Consulting Group. from https://stats.oarc.ucla.edu/stata/faq/how-can-i-detect-duplicate-observations-3/ (accessed June 15, 2023).

The tutorial on the website used the High School and Beyond dataset. Here are the steps taken to introduce duplicates to the dataset.

Start with the High School and Beyond dataset, which initially has no duplicate observations.

Code

library(tidyverse)
hsb2 <- 
    # load the dataset
    haven::read_dta("https://stats.idre.ucla.edu/stat/stata/notes/hsb2.dta") |>
    # select variables of interest
    select(id, female, ses, read, write, math) |> 
    # sort by id
    arrange(id)

Add five duplicate observations to the dataset to create duplicates. Change a value in one of the duplicate observations.

Code

hsb2_mod <- 
    hsb2 |> 
    # take the first five observations
    slice(1:5) |> 
    # add duplicate observations
    bind_rows(hsb2) |> 
    mutate(math = ifelse(row_number() == 1, 84, math))
# display the first few rows
hsb2_mod

# A tibble: 205 × 6
      id female     ses         read write  math
   <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl>
 1     1 1 [female] 1 [low]       34    44    84
 2     2 1 [female] 2 [middle]    39    41    33
 3     3 0 [male]   1 [low]       63    65    48
 4     4 1 [female] 1 [low]       44    50    41
 5     5 0 [male]   1 [low]       47    40    43
 6     1 1 [female] 1 [low]       34    44    40
 7     2 1 [female] 2 [middle]    39    41    33
 8     3 0 [male]   1 [low]       63    65    48
 9     4 1 [female] 1 [low]       44    50    41
10     5 0 [male]   1 [low]       47    40    43
# ℹ 195 more rows

After adding the duplicate observations, you will have a total of 195 unique observations and 5 duplicated observations in the dataset. We can use the tag_duplicates() function from the mStats package.

Code

hsb2_mod |> 
    # check duplicate report and status using a mStats function
    mutate(mStats::tag_duplicates(everything()))

$ Report of duplicates
  in terms of all variables
 copies observations surplus
      1          197       0
      2            8       4

# A tibble: 205 × 9
      id female     ses         read write  math   .n_   .N_ .dup_
   <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl> <int> <int> <lgl>
 1     1 1 [female] 1 [low]       34    44    84     1     1 FALSE
 2     2 1 [female] 2 [middle]    39    41    33     1     2 FALSE
 3     3 0 [male]   1 [low]       63    65    48     1     2 FALSE
 4     4 1 [female] 1 [low]       44    50    41     1     2 FALSE
 5     5 0 [male]   1 [low]       47    40    43     1     2 FALSE
 6     1 1 [female] 1 [low]       34    44    40     1     1 FALSE
 7     2 1 [female] 2 [middle]    39    41    33     2     2 TRUE 
 8     3 0 [male]   1 [low]       63    65    48     2     2 TRUE 
 9     4 1 [female] 1 [low]       44    50    41     2     2 TRUE 
10     5 0 [male]   1 [low]       47    40    43     2     2 TRUE 
# ℹ 195 more rows

Let’s check duplicates by id.

Code

hsb2_mod |> 
    # check duplicates by id
    mutate(mStats::tag_duplicates(id))

$ Report of duplicates
  in terms of id
 copies observations surplus
      1          195       0
      2           10       5

# A tibble: 205 × 9
      id female     ses         read write  math   .n_   .N_ .dup_
   <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl> <int> <int> <lgl>
 1     1 1 [female] 1 [low]       34    44    84     1     2 FALSE
 2     2 1 [female] 2 [middle]    39    41    33     1     2 FALSE
 3     3 0 [male]   1 [low]       63    65    48     1     2 FALSE
 4     4 1 [female] 1 [low]       44    50    41     1     2 FALSE
 5     5 0 [male]   1 [low]       47    40    43     1     2 FALSE
 6     1 1 [female] 1 [low]       34    44    40     2     2 TRUE 
 7     2 1 [female] 2 [middle]    39    41    33     2     2 TRUE 
 8     3 0 [male]   1 [low]       63    65    48     2     2 TRUE 
 9     4 1 [female] 1 [low]       44    50    41     2     2 TRUE 
10     5 0 [male]   1 [low]       47    40    43     2     2 TRUE 
# ℹ 195 more rows

Photo credit: Photo by Dids from Pexels

Citation

BibTeX citation:

@online{minn_oo2023,
  author = {Minn Oo, Myo},
  title = {Duplicates in {R}},
  date = {2023-06-15},
  url = {https://myominnoo.com/blog/2023-06-15-duplicates-R/},
  langid = {en}
}

For attribution, please cite this work as:

Minn Oo, Myo. 2023. “Duplicates in R.” June 15, 2023. https://myominnoo.com/blog/2023-06-15-duplicates-R/.

1 Replicate examples on UCLA’s STATA tutorial in R

Citation

If you'd like to support my work: