Parsing data for a non-profit that oversees college athletics

The Problem:

An Indianapolis-based nonprofit that regulates student athletes receives student transcripts from high schools around the country. The transcripts contain the required information, but in thousands of different formats. As a result, employees had to enter the documents into a database manually. This was costing the company hundreds of man-hours per week. This was a time consuming and error-prone process. Other local companies were tasked with parsing the data but were ultimately unsuccessful.

The Goal: Parsing the Data

Systematically identify courses and grades across academic year regardless of transcript format. And parse the data from all transcripts into a standard format – eventually entering it into the client’s transcript system. This reduces the manual time and cost.


How We Solved it:

  • RoboSource used a text extraction library to pinpoint the x and y coordinates for course name, course ID, and course grade.
  • We then captured the coordinates of each data point within each document format.
  • Parsing the data, we also prescrubbed it before loading it into the database. Up until now those steps had to be done manually.


  • RoboSource demonstrated the ability to automate the process of identifying and capturing transcript data across thousands of formats. Phase 2 of this project will save the company hundreds of thousands of dollars annually.
  • We are also starting the process of using machine learning from previous transcripts so when the algorithm recognizes the format it puts the transcript into the correct template.
  • Future steps are to build API to the database and utilize Optical Character Recognition to capture information from scanned transcripts and images.


