Identifying specific differences between two datasets in Rstudio.

Harsh Ruchir Edirisinghe
6 min readSep 21, 2021

At times, data migration processes may harm the quality of data. Fear not!

A quick data comparison can be completed in a few minutes. This enables anyone to identify lapses in the other dataset.

Pre-requisite - A basic understanding of how to use Rstudio.

Photo by Jon Tyson on Unsplash

Overview:

The provided code can be used to identify any changes between multiple migrated files and its corresponding source files.

The changes that can be observed:

  1. Missing rows/columns
  2. Rows/columns which shouldn’t exist
  3. Incorrect values in any row
  4. Incorrect column headers
  5. Incorrect ordering of rows & columns

The code works with csv files throughout the whole procedure. Three files will be produced for every file to be checked. An html file will give a user-friendly summary & understanding of the inconsistencies in the data. The second file will be a csv file that will give a user access to the changes. The third file will be the patched-up file (the file that has incorporated any adjustments after accounting for the errors).

A final file with the scores will be produced that gives the percentages of rows that have not undergone any significant adjustments. The reason being is that if a column is missing, the score would become zero since every row will need an alteration.

Simple ground rules before beginning the exercise:

1. To run any piece of code, press Ctrl + Enter at the same time

2. (Only once on a PC) Install the below packages by copying & pasting the code into your workspace & running it:

install.packages(“daff”)

install.packages(“readxl”)

install.packages(“htmltools”)

3. In the event of accidental deletion/changes to any code (in which you can’t undo), please redownload the files again.

4. Prepare the required files : New and old files and a mapping file. A mapping file is a table that shows the old names and new names of a file.

5. Close any files which would be used by the tool.

Steps:

1. Label each variable below (The definitions have been provided below).

Use quotation marks around each name.

Definitions-

Migrated_Files_Dir -> the location of the folder which contains the csv files that would be checked

Example of a source directory = “C:\\Users\\C01234\\OneDrive\\Documents\\CSV Files

Source_Files_Dir -> the location of the folder which contains the csv files that would be used as the benchmark for the test against its corresponding files in “Migrated_Files_Dir”

The_Other_Script -> the location of the other downloaded R file (“Final_Main_Script_V1.R)

(Please note that this variable is defined until the file name)

Example = The_Other_Script<-source(“C:\\Users\\C01234\\OneDrive\\Documents\\Untouched_V6.R”)

Data_Assessment_Save_Loc -> The location of the folder which would store the above-mentioned html files.

Detailed_Changes_Save_Loc -> The location of the folder which would store the files that will show the detailed changes for each file.

Patched_Files_Save_Loc -> The location of the folder which would store the patched-up files.

Scores_Save_Loc -> The location of the folder which would store a csv file containing the accuracy scores of the data migration process for each file.

Mapping_Table_Loc -> The location of the mapping file which is used for changing table names, column names and any other value conversions. (Please note that only this variable is defined until the file name) ( An example is provided below, where the column names have been set and should be strictly followed)

# Defining the variables
Migrated_Files_Dir<-"C:/Users/New_Docs/Dummy Files/Migrated_Files(BAD_FILES)"
Source_Files_Dir<-"C:/Users/New_Docs/Dummy Files/Source_Files(GOOD_FILES)"Data_Assessment_Save_Loc<-"C:/Users/New_Docs/Dummy Files"Detailed_Changes_Save_Loc<-"C:/Users/New_Docs/Dummy Files"Patched_Files_Save_Loc<-"C:/Users/New_Docs/Dummy Files"Scores_Save_Loc<-"C:/Users/New_Docs/Dummy Files"Mapping_Table_Loc<-"C:/Users/New_Docs/Dummy Files/Mapping_Folder/Mapping.csv"

We will use some Netflix data as shown below.

The source file.
The source file.

Now, we have an incorrectly migrated table.

The migrated table
The migrated file.

Now, we look at our prepared mapping file. It highlights information about the old and new file names, column names and any other data.

Mapping data.

Save the below code in another script (save as “Final_Main_Script_V1.R” )to be referenced later.

#### START ####
library(daff)
library(htmltools)
library(readxl)
# Identify the migrated files & its corresponding source filesMigrated_Files <- list.files(path = Migrated_Files_Dir,recursive = TRUE)Source_Files <- list.files(path = Source_Files_Dir,recursive = TRUE)#Import the filessetwd(Migrated_Files_Dir)
Migrated_Files_Bundle<-lapply(Migrated_Files,function(i){
read.csv(i,header=TRUE,fileEncoding="UTF-8-BOM")
})
names(Migrated_Files_Bundle)<-Migrated_Files
setwd(Source_Files_Dir)
Source_Files_Bundle<-lapply(Source_Files,function(i){
read.csv(i,header=TRUE,fileEncoding="UTF-8-BOM")
})
names(Source_Files_Bundle)<-Source_Files

#Mapping Procedure
#Import the mapping tableMapping_Table<-read.csv(Mapping_Table_Loc,fileEncoding="UTF-8-BOM")####Conduct mapping procedure#####Convert mapping tableMapping_previous_names<-as.list(Mapping_Table$Previous_Name)
Mapping_new_names<-as.list(Mapping_Table$New_Name)
New_Migrated_Files_Bundle<-Migrated_Files_Bundle# Mapping the table names
k=1
for(i in names(Migrated_Files_Bundle)){
s<-1
for (j in Mapping_new_names){

if(i==j){

names(New_Migrated_Files_Bundle)[k]<-Mapping_previous_names[s]
break

}
s<-s+1

}
k<-k+1
}
#Import list objects into environments
my.env<-new.env()
list2env(New_Migrated_Files_Bundle,envir =my.env)#Remove unneeded objects
rm(Migrated_Files_Bundle)
rm(New_Migrated_Files_Bundle)
rm(Source_Files_Bundle)
rm(Mapping_Table)
rm(k)
rm(s)
rm(i)
rm(j)
#Mapping the column names
k<-1
m<-1
s<-1
for(i in names(my.env)){


for(i in colnames(my.env[[names(my.env)[k]]])){


for( j in Mapping_new_names){

if(colnames(my.env[[names(my.env)[k]]])[m]==Mapping_new_names[[s]]){

colnames(my.env[[names(my.env)[k]]])[m]<-Mapping_previous_names[[s]]
break

}
s<-s+1

}
s<-1
m<-m+1

}
m<-1
k<-k+1

}
# Mapping the remainder
k<-1
s<-1
n<-1
f<-1
for(i in names(my.env)){

for(a in colnames(my.env[[names(my.env)[k]]])){

for(b in my.env[[names(my.env)[k]]][,n]){

for( j in Mapping_new_names){

if(!is.na(my.env[[names(my.env)[k]]][,n][f])){

if(my.env[[names(my.env)[k]]][,n][f]==Mapping_new_names[[s]]){

my.env[[names(my.env)[k]]][,n][f]<-Mapping_previous_names[[s]]
break

}
s<-s+1
}

}
s<-1
f<-f+1

}
s<-1
f<-1
n<-n+1
}
k<-k+1
s<-1
n<-1
f<-1
}
######################################################################################################### Compute & save multiple data assessment filesfor(i in names(my.env)){
for( j in Source_Files){
if(i == j){
write.table(render_diff(diff_data(my.env[[i]],read.csv(paste(Source_Files_Dir,"\\",j,sep=""),fileEncoding="UTF-8-BOM"),Source_Files[Source_Files==j],
show_unchanged = TRUE,show_unchanged_columns = TRUE),title =paste("Changes_in_file",i) ),file =
paste(Data_Assessment_Save_Loc,"/Changes_in_",i,".html",sep = "")
,quote = FALSE,col.names = FALSE,
row.names = FALSE)

}
}
}
# Produce csv files that allows a user to browse through the changeslist_of_detailed_changes_files=list()
z<-1
for( i in names(my.env)){
for (j in Source_Files){

if(i == j){
detailed_changes<-diff_data(my.env[[i]],read.csv(paste(Source_Files_Dir,"\\",j,sep=""),fileEncoding="UTF-8-BOM"),show_unchanged = TRUE,show_unchanged_columns = TRUE)

write_diff(detailed_changes,file=paste(Detailed_Changes_Save_Loc,"\\New_",i,sep=""))

x<-read.csv(file=paste(Detailed_Changes_Save_Loc,"/New_",i,sep=""),fill = TRUE)

if(z<10^1000000000000000){
list_of_detailed_changes_files[[z]]<-x
names(list_of_detailed_changes_files)[z]<-i
z<-z+1
}



}
}
}
# Produce patched up files ( the corrected version of the file )for(i in names(my.env)){
for(j in Source_Files){

if(i==j){

patch<-diff_data(my.env[[i]],read.csv(paste(Source_Files_Dir,"\\",j,sep=""),fileEncoding="UTF-8-BOM"),show_unchanged = TRUE,show_unchanged_columns = TRUE)

patched_file<-patch_data(my.env[[i]],patch)

write.csv(patched_file,file = paste(Patched_Files_Save_Loc,"/Patched_",i,sep = ""),row.names = FALSE)

}

}

}
#Accuracy scores based on the number of unaltered rows/records.list_of_scores=list()
g<-0
for(i in list_of_detailed_changes_files){


score<-((nrow(subset(as.data.frame(i),as.data.frame(i)[,2]==""))/((nrow(as.data.frame(i)))-1)*100))

g<-g+1

list_of_scores=append(list_of_scores,paste("The percentage of rows that have remained nearly unchanged for",names(list_of_detailed_changes_files[g]),"is", round(score,2),"%" ))



}
write.table(list_of_scores,file=paste(Scores_Save_Loc,"/Scores_Migrated.csv",sep=""),row.names = FALSE,col.names = FALSE,sep="\n")#### END ####

Congratulations! You’re almost done. Now update and execute the last line of code.

#The other script's location should be specified in the below line
The_Other_Script<-source("C:\\Users\\New_Docs\\Final_Main_Script_V1.R")

You will receive an output underlining all the differences between the two chosen files.

First final output as a html file.

The first output will provide a detailed understanding of the differences between the two tables. Cells highlighed in blue indicate required modifications to be done to the values. Cells highlighed in red indicate records to be deleted from the table. Cells highlighed in green indicate records to be added. Cells highlighed in yellow indicate the records that should be reordered.

The second output is the same html file but as a csv file for better manipulation of the data.

The third output is a patched up file where the necessary adjustments are done to arrive at the source file.

The last output is a file that provides the scoring of how well the migrated file performed during the assessment.

You’re all set !

Here’s a pikachu and a squirtle to celebrate your success!

References :- https://github.com/edwindj/daf

--

--

Harsh Ruchir Edirisinghe

Highly enthusiastic about the field of data science and how it can transform the world by solving any dilemma ,regardless of the size and context of it . Skille