Duplicates in R
List, tag, report duplicates in R like STATA
1 Replicate examples on UCLA’s STATA tutorial in R
Citation: HOW CAN I DETECT DUPLICATE OBSERVATIONS? | STATA FAQ. UCLA: Statistical Consulting Group. from https://stats.oarc.ucla.edu/stata/faq/how-can-i-detect-duplicate-observations-3/ (accessed June 15, 2023).
The tutorial on the website used the High School and Beyond dataset. Here are the steps taken to introduce duplicates to the dataset.
Start with the High School and Beyond dataset, which initially has no duplicate observations.
Add five duplicate observations to the dataset to create duplicates. Change a value in one of the duplicate observations.
Code
# A tibble: 205 × 6 id female ses read write math <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> 1 1 1 [female] 1 [low] 34 44 84 2 2 1 [female] 2 [middle] 39 41 33 3 3 0 [male] 1 [low] 63 65 48 4 4 1 [female] 1 [low] 44 50 41 5 5 0 [male] 1 [low] 47 40 43 6 1 1 [female] 1 [low] 34 44 40 7 2 1 [female] 2 [middle] 39 41 33 8 3 0 [male] 1 [low] 63 65 48 9 4 1 [female] 1 [low] 44 50 41 10 5 0 [male] 1 [low] 47 40 43 # ℹ 195 more rows
After adding the duplicate observations, you will have a total of 195 unique observations and 5 duplicated observations in the dataset. We can use the
tag_duplicates()
function from themStats
package.Code
$ Report of duplicates in terms of all variables copies observations surplus 1 197 0 2 8 4
# A tibble: 205 × 9 id female ses read write math .n_ .N_ .dup_ <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <int> <int> <lgl> 1 1 1 [female] 1 [low] 34 44 84 1 1 FALSE 2 2 1 [female] 2 [middle] 39 41 33 1 2 FALSE 3 3 0 [male] 1 [low] 63 65 48 1 2 FALSE 4 4 1 [female] 1 [low] 44 50 41 1 2 FALSE 5 5 0 [male] 1 [low] 47 40 43 1 2 FALSE 6 1 1 [female] 1 [low] 34 44 40 1 1 FALSE 7 2 1 [female] 2 [middle] 39 41 33 2 2 TRUE 8 3 0 [male] 1 [low] 63 65 48 2 2 TRUE 9 4 1 [female] 1 [low] 44 50 41 2 2 TRUE 10 5 0 [male] 1 [low] 47 40 43 2 2 TRUE # ℹ 195 more rows
Let’s check duplicates by
id
.$ Report of duplicates in terms of id copies observations surplus 1 195 0 2 10 5
# A tibble: 205 × 9 id female ses read write math .n_ .N_ .dup_ <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <int> <int> <lgl> 1 1 1 [female] 1 [low] 34 44 84 1 2 FALSE 2 2 1 [female] 2 [middle] 39 41 33 1 2 FALSE 3 3 0 [male] 1 [low] 63 65 48 1 2 FALSE 4 4 1 [female] 1 [low] 44 50 41 1 2 FALSE 5 5 0 [male] 1 [low] 47 40 43 1 2 FALSE 6 1 1 [female] 1 [low] 34 44 40 2 2 TRUE 7 2 1 [female] 2 [middle] 39 41 33 2 2 TRUE 8 3 0 [male] 1 [low] 63 65 48 2 2 TRUE 9 4 1 [female] 1 [low] 44 50 41 2 2 TRUE 10 5 0 [male] 1 [low] 47 40 43 2 2 TRUE # ℹ 195 more rows
Photo credit: Photo by Dids from Pexels
Citation
@online{minn_oo2023,
author = {Minn Oo, Myo},
title = {Duplicates in {R}},
date = {2023-06-15},
url = {https://myominnoo.com/blog/2023-06-15-duplicates-R/},
langid = {en}
}