List, tag, report duplicates in R like STATA

R
Data Wrangling
Author

Myo Minn Oo

Published

June 15, 2023

Modified

July 13, 2024

1 Replicate examples on UCLA’s STATA tutorial in R

Citation: HOW CAN I DETECT DUPLICATE OBSERVATIONS? | STATA FAQ. UCLA: Statistical Consulting Group. from https://stats.oarc.ucla.edu/stata/faq/how-can-i-detect-duplicate-observations-3/ (accessed June 15, 2023).

The tutorial on the website used the High School and Beyond dataset. Here are the steps taken to introduce duplicates to the dataset.

  1. Start with the High School and Beyond dataset, which initially has no duplicate observations.

    Code
    library(tidyverse)
    hsb2 <- 
        # load the dataset
        haven::read_dta("https://stats.idre.ucla.edu/stat/stata/notes/hsb2.dta") |>
        # select variables of interest
        select(id, female, ses, read, write, math) |> 
        # sort by id
        arrange(id)
  2. Add five duplicate observations to the dataset to create duplicates. Change a value in one of the duplicate observations.

    Code
    hsb2_mod <- 
        hsb2 |> 
        # take the first five observations
        slice(1:5) |> 
        # add duplicate observations
        bind_rows(hsb2) |> 
        mutate(math = ifelse(row_number() == 1, 84, math))
    # display the first few rows
    hsb2_mod
    # A tibble: 205 × 6
          id female     ses         read write  math
       <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl>
     1     1 1 [female] 1 [low]       34    44    84
     2     2 1 [female] 2 [middle]    39    41    33
     3     3 0 [male]   1 [low]       63    65    48
     4     4 1 [female] 1 [low]       44    50    41
     5     5 0 [male]   1 [low]       47    40    43
     6     1 1 [female] 1 [low]       34    44    40
     7     2 1 [female] 2 [middle]    39    41    33
     8     3 0 [male]   1 [low]       63    65    48
     9     4 1 [female] 1 [low]       44    50    41
    10     5 0 [male]   1 [low]       47    40    43
    # ℹ 195 more rows
  3. After adding the duplicate observations, you will have a total of 195 unique observations and 5 duplicated observations in the dataset. We can use the tag_duplicates() function from the mStats package.

    Code
    hsb2_mod |> 
        # check duplicate report and status using a mStats function
        mutate(mStats::tag_duplicates(everything()))
    $ Report of duplicates
      in terms of all variables
     copies observations surplus
          1          197       0
          2            8       4
    # A tibble: 205 × 9
          id female     ses         read write  math   .n_   .N_ .dup_
       <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl> <int> <int> <lgl>
     1     1 1 [female] 1 [low]       34    44    84     1     1 FALSE
     2     2 1 [female] 2 [middle]    39    41    33     1     2 FALSE
     3     3 0 [male]   1 [low]       63    65    48     1     2 FALSE
     4     4 1 [female] 1 [low]       44    50    41     1     2 FALSE
     5     5 0 [male]   1 [low]       47    40    43     1     2 FALSE
     6     1 1 [female] 1 [low]       34    44    40     1     1 FALSE
     7     2 1 [female] 2 [middle]    39    41    33     2     2 TRUE 
     8     3 0 [male]   1 [low]       63    65    48     2     2 TRUE 
     9     4 1 [female] 1 [low]       44    50    41     2     2 TRUE 
    10     5 0 [male]   1 [low]       47    40    43     2     2 TRUE 
    # ℹ 195 more rows
  4. Let’s check duplicates by id.

    Code
    hsb2_mod |> 
        # check duplicates by id
        mutate(mStats::tag_duplicates(id))
    $ Report of duplicates
      in terms of id
     copies observations surplus
          1          195       0
          2           10       5
    # A tibble: 205 × 9
          id female     ses         read write  math   .n_   .N_ .dup_
       <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl> <int> <int> <lgl>
     1     1 1 [female] 1 [low]       34    44    84     1     2 FALSE
     2     2 1 [female] 2 [middle]    39    41    33     1     2 FALSE
     3     3 0 [male]   1 [low]       63    65    48     1     2 FALSE
     4     4 1 [female] 1 [low]       44    50    41     1     2 FALSE
     5     5 0 [male]   1 [low]       47    40    43     1     2 FALSE
     6     1 1 [female] 1 [low]       34    44    40     2     2 TRUE 
     7     2 1 [female] 2 [middle]    39    41    33     2     2 TRUE 
     8     3 0 [male]   1 [low]       63    65    48     2     2 TRUE 
     9     4 1 [female] 1 [low]       44    50    41     2     2 TRUE 
    10     5 0 [male]   1 [low]       47    40    43     2     2 TRUE 
    # ℹ 195 more rows

Photo credit: Photo by Dids from Pexels

Citation

BibTeX citation:
@online{minn_oo2023,
  author = {Minn Oo, Myo},
  title = {Duplicates in {R}},
  date = {2023-06-15},
  url = {https://myominnoo.com/blog/2023-06-15-duplicates-R/},
  langid = {en}
}
For attribution, please cite this work as:
Minn Oo, Myo. 2023. “Duplicates in R.” June 15, 2023. https://myominnoo.com/blog/2023-06-15-duplicates-R/.
If you'd like to support my work: