RoboSource Speciality: Data Cleansing Via Machine Learning

 

The Problem

Our client is a nonprofit that since 1997 has developed STEM curricula for use in US schools from pre-K through high school across all 50 states. This company’s data system had 5.5 million student records that contained massive numbers of duplicate data. Also, many student records had been lost or disconnected. At the time, the nonprofit did not know how to efficiently clear duplicate data records and reconnect lost student data. The nonprofit needed data cleansing, and RoboSource was able to do it through machine learning.

The Goal: Data Cleansing

Our goal was two-fold:

  1. Find ways to connect disconnected student records by looking through old database backups.
  2. Train an algorithm on machine learning to sort through sample data and identify how to recognize duplicates. Then run the algorithm on the entire data set to identify duplicates in the system.

How We Solved it

The RoboSource team was given data backups and was able to reconnect the majority of disconnected student records using Python code to populate a new Postgres database. Then, utilizing the Dedupe.io library, RoboSource manually trained a machine-learning algorithm to identify duplicates. The machine learned which fields were important for identifying duplicates, and then applied that knowledge to the rest of the data. Half a million student records were identified as duplicates with a 95% accuracy, and were automatically merged. Then the RoboSource team worked on a way that this process could be re-run on a regular basis to continue to de-dupe new student records.

Read more: Case study about interpreting medical data.


Brett Ridoux

The RoboSource team can craft all kinds of custom software, but one of our sweet spots is big data cleanup and data analytics. Learn more about our sweet spots on our Solutions Page.

If you aren’t getting the information you need out of your data, email Brett Ridoux at brett.ridoux@robosource.us for a free consultation.