Offer Offer

News We Recently Launched AD Migrator and AD Reporter.

How to Find Duplicate Files in SharePoint Online? Efficiently

  author
Written By Mohit Jha
Anuraag Singh
Approved By Anuraag Singh
Published On November 18th, 2024
Reading Time 7 Minutes Reading

Keep your SharePoint Online environment uncluttered by finding and deleting duplicate files. But the question is, how can this be possible. Don’t fret, if you are unaware of the process to find duplicate files in SharePoint Online. Here you will find the complete step by step guide to keeping your SharePoint environment clutter-free. So, let’s begin.

Because of the immense use of SharePoint Online in today’s technological era. There are several situations in which the duplicate files can be created in the SharePoint Online environment. For instance, if there’s a connectivity issue while using OneDrive sync, duplicate files may be generated. Additionally, reusing the document as templates without proper naming conventions can result in duplicate files. These are some common reasons, but for you it can either be one of them or another reason as well.

But whatever the reason is behind the duplicate files in SharePoint Online, you can sort them with the methods discussed below. So, let’s start discussing them in a detailed manner.

Method 1. Microsoft’s SharePoint Duplicate Analysis Tool (DeDup)

Microsoft offered its own duplicate files finder tool for SharePoint Online. It is available on the Azure Marketplace. But remember it is a paid tool which also requires technical expertise to operate. Below are the steps of the Deduplicator tool.

Step 1. Initially Download the DeDup tool on your machine. Then login with the Microsoft credentials.
Step 2. Hit the Accept button and move further from the permissions screen.
Step 3. From the User menu open the Credentials. Then add new credentials.
Step 4. Open the User menu then Add Sites.
Step 5. It’s time to load the SharePoint sites and then press the Rescan button to find duplicate files in SharePoint Online.
Step 6. Now, select the site for scanning and track the progress information.
Step 8. Just after the completion of the process, you can view the duplicate files through the dashboard.
Step 9. Open the More Details option to get a detailed overview of the duplicate files.
Step 10. Click on the Audit option to save the DeDup’s SharePoint duplicate analysis tool’s generated report in the Excel file format for further analysis.

Method 2. Find Duplicate Files in SharePoint Online Manually

If you are searching for the duplicate files in a small set of data, then you can also perform the manual search. Using the manual search, you can identify the duplicate files within the SharePoint document library easily without any technical expertise or tool. You just need to open the Document library in which you want to search the duplicate files. Then check the files based on their modified data, name, size etc.

You can also open the document to check whether the content is duplicated or not. Also do not forget to use the advanced filter to sort the files according to several factors.

But this approach is only suitable for a small set of data, otherwise you need to spend a lot of time on it. As a result, you might face frustration and confusion which leads to deletion of the crucial files also.

Method 3. List All the Duplicate Files Using PowerShell Script

If you are working with the PowerShell and want to automate the tasks, then you can use the PowerShell commands. But remember, the commands should be executed sequentially to get the expected results.

#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"

#Parameters
$URLofSharePointSite = "enter here"
$CSV_Loc = "C:\Temp\Duplicates.csv"
$Batch_Size = 1000

$DataCollection = @()

#Get credentials to connect
$AllCredt = Get-Credential

Try {
#Setup the Context
$Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($URLofSharePointSite)
$Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($AllCredt.UserName, $AllCredt.Password)

#Get the Web
$Web = $Ctx.Web
$ListsContainer = $Web.Lists
$Ctx.Load($Web)
$Ctx.Load($ListsContainer)
$Ctx.ExecuteQuery()

ForEach($Li in $ListsContainer)
{

If($Li.BaseType -eq "DocumentLibrary" -and $Li.Hidden -eq $False -and $Li.ItemCount -gt 0 -and $Li.Title -Notin("PagesofSite","Style Library", "Preservation Hold Library"))
{

$Query = New-Object Microsoft.SharePoint.Client.CamlQuery
$Query.ViewXml = "@

$Batch_Size
"

$Cnt = 1

Do {
$LiItems = $Li.GetItems($Query)
$Ctx.Load($LitItems)
$Ctx.ExecuteQuery()

ForEach($Item in $LiItems)
{
#Fiter Files
If($Item.FileSystemObjectType -eq "File")
{

$File = $Item.File
$Ctx.Load($File)
$Ctx.ExecuteQuery()
Write-Progress -PercentComplete ($Cnt / $Li.ItemCount * 100) -Activity "File Processing $Cnt of $($Li.ItemCount) in $($Li.Title) of $($Web.URL)" -Status "Scan Files '$($File.Name)'"

#Get The File Hash
$Bytes = $File.OpenBinaryStream()
$Ctx.ExecuteQuery()
$MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))

#Collect data
$Data = New-Object PSObject
$Data | Add-Member -MemberType NoteProperty -name "Name_ofFile" -value $File.Name
$Data | Add-Member -MemberType NoteProperty -Name "File_Hash_Code" -value $HashCode
$Data | Add-Member -MemberType NoteProperty -Name "URL_ofFile" -value $File.ServerRelativeUrl
$Data | Add-Member -MemberType NoteProperty -Name "Size_ofFile" -value $File.Length
$DataCollection += $Data
}
$Cnt++
}
#Update Position of the ListItemCollectionPosition
$Query.ListItemCollectionPosition = $LiItems.ListItemCollectionPosition
}While($Query.ListItemCollectionPosition -ne $null)
}
}
#Export All Data to CSV
$DataCollection | Export-Csv -Path $CSV_Loc -NoTypeInformation
Write-host -f Green "Exported to a CSV File $CSV_Loc"

$SharePointDuplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Duplicates as per the Hashcode:"
$SharePointDuplicates | Format-table -AutoSize

#Group Based on File Name
$Duplicates_FileName = $DataCollection | Group-Object -Property FileName | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host “Duplicates Files in SharePoint Online according to the File Name:"
$Duplicates_FileName| Format-table -AutoSize

#Group Based on File Size
$Duplicates_FileSize = $DataCollection | Group-Object -Property FileSize | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Duplicates Files as per the File Size:"
$Duplicates_FileSize| Format-table -AutoSize
}
Catch {
write-host -f Red "Error:" $_.Exception.Message
}

}

Delete Duplicate Files in SharePoint Online

After finding the duplicate files in SharePoint, execute the below script to delete them. It is a way better approach to delete the duplicate files rather than manually.

# Define the source path
$sourceLocation = "C:\Temp\New"

# Get all files with the same size
$AllFiles = Get-ChildItem -Path $sourceLocation -File -Recurse | Sort-Object LastWriteTime -Descending | Group-Object -Property Length | Where-Object {$_.Count -gt 1}

# Group files by their hash and find duplicates
$AllduplicateFiles = $AllFiles | Select -ExpandProperty Group | Get-FileHash | Group-Object -Property Hash | Where-Object {$_.Count -gt 1}

#Delete the Duplicate files
if ($AllduplicateFiles.Count -eq 0) {
Write-Output "No duplicate files found in SharePoint."
} else {
Write-Output "Founded Duplicate files are deleted successfully:"
$AllduplicateFiles | ForEach-Object {
$AllfilesforDelete = $_.Group | Select-Object -Skip 1
$AllfilesforDelete | ForEach-Object {
Write-Output "Deleting: $($_.Path)"
Remove-Item -Path $_.Path -Force
}
}

Now, by using the above methods, you can find duplicate files in SharePoint Online. But deleting the duplicate files in SharePoint Online without taking backup SharePoint Online to local storage or on cloud can be unsafe. So, let’s find how to reduce the chances of data loss when you find and delete duplicate files in SharePoint Online.

Move SharePoint Online to Another SharePoint Online

As it is a risky process to directly delete the duplicate files from SharePoint Online. So, you can use the Super Productive SharePoint Migrator to move your SharePoint data to another account as backup. If in any case, you lost your SharePoint data then it will be used in future for a smooth and flawless organization’s workflow.

Download Now Purchase Now

This tool only requires below quick steps to accomplish the task.

  • Download and Execute the tool.
  • Choose both platforms as O365.
  • Select Site and enter platform details.
  • Add Users and Sites, then Start backup on another SharePoint account.

Conclusion

In this write-up, we have explained the distinct methods to find duplicate files in SharePoint Online. You can use any of the methods that suit you the best. Also, do not forget to store your SharePoint Online data to another SharePoint account for further uses.

  author

By Mohit Jha

Mohit is a writer, researcher, and editor. Cyber ​​security and digital forensics are the two subjects that keep Mohit out of his seat. In addition, he hopes that the well-researched and thought-out articles he finds will help people learn.