Natural Language Processing & Machine Learning in Higher Education

9 min readJan 22, 2024

Natural language processing and machine learning give higher education unique opportunities to draw insights from structured and unstructured data.

In this article, we will discuss how MentorMate and our partner eLumen leveraged natural language processing (NLP) and machine learning (ML) for data-driven decision-making to tame the curriculum beast in higher education. Here, we will primarily focus on drawing insights from structured and unstructured (text) data.

Data Challenges in Higher Education

To begin with, let’s look at the problem in the domain. Some of the challenges that higher education institutions and their data analytics partners face are customized quality standards, the unstructured data hassle, and the need for more insights.

Customized Quality Standards

Quality standards in higher education are often self-imposed. There’s no common core but rather a commitment to quality, which means institutions must identify and establish their own standards. This process thus becomes much more heavily reliant on local knowledge or faculty practice than institutional legibility.

The Unstructured Data Hassle

Most data (grades or credits) hugely depend on unstructured data (learning outcomes, program rules, the syllabus) for meaning. For instance, if we say that a student has a 3.2 score, this means nothing. But if we say that they are a computer science major focusing on cybersecurity and have attained three certifications, that’s a lot of meaning. However, most of it is text that must be aligned to meaning.

Lacking Insights

Initiatives ranging from workforce alignment to transparent advising and tutoring are needed by the lack of insights from these text-bound rules, standards, and alignments. Our work uncovers these and evolves them on campuses.

The Ecosystem Today

Suppose you look at the learning ecosystem today. A wide variety of Enterprise Resource Planning (ERP) systems and Learning Management Systems (LMS) are on campus, such as Canvas, Workday, and Banner.

Most of these tools share the principle that very little data is interchanged between these systems, and lots of data (mainly qualitative) is lost. The Learning Tools Interoperability (LTI) standard’s only callback to the LMS sends a value between zero and one on an assignment (for example, 86%). This is very anemic because these systems were built in an era where all they had was a relational database. They were typically built before there was even reliable networking.

Natural Language Processing and Machine Learning: The Big Opportunity

The biggest opportunity for leveraging a modern data and tech stack is to take a system like Canvas and wrap it or underlie it with these technologies. By doing so, you can take what happens in a Canvas course and round-trip it with program changes and new course offerings to provide continuous visibility into learning and curriculum performance. Leveraging machine learning and natural language processing injects a recommended learning outcome within a rubric into a course. All faculty changes are captured and aligned to the other expectation to see whether or not it’s a better way to teach that curriculum.

Challenges to This Approach

Of course, there are several challenges to this approach. A faculty can change things like learning outcomes, the assessment rubric, or the LMS grading rubric before using them. So even if there is a standard, nothing in the tool or the cultural practice says that the faculty must follow it. It’s up to them to decide what to do, and they’ll always be prone to changing things. For us, this means we must be able to capture these changes. Besides, versions and variants of these artifacts (student learning outcome statements) do need consistent alignment over time if they will be used for official records, comprehensive learner records, or badge wallets.

This overall process is certainly not sustainable if managed by humans because it only gives more text-based work for data stewards or institutional researchers who end up looking for needles in haystacks of meaning. So the system must be able to take the textual data that faculties put in their syllabus, courses, and rubrics and understand what it means.

Solution and Tech Stack

Elumen Insights’ Data & Analytics Architecture

Figure 1 shows a 10,000-feet view of the data and analytics architecture of Insights, eLumen’s learning outcomes assessment tool. This architecture features critical parts of most data analytics solutions, which include data ingestion, data curation, representation storage, and analytics. The primary data sources used in eLumen Insights are on the left-hand side of the architecture. This typically includes the LMS, which provides all the data items related to the learning processes within the institution. Then, the Student Information System (SIS) provides data about students, including enrollments and demographics, among other data items. There might also be other data sources, such as additional ERP systems containing useful data that cannot always be found inside the LMS or the SIS. The digestion of data from these sources typically happens through APIs provided by these systems.

Figure 1. High-level architecture of Insights’ data and analytics architecture.

We perform data curation tasks with the help of machine learning and natural language processing, where end users further validate the outputs of these tasks before they are persisted in our data stores. The resulting curated data is presented using three fundamentally different storages: graph databases to store entities and relationships, key-value stores for app configuration data, and a data lake for online analytical processing and other data analytics tasks. The machine learning and analytics layer benefits from the curated and persisted data for providing analytics services. This includes machine learning performed on graph data to learn missing attributes or relationships in the graph and self-service analytics through tools such as Tableau and Power BI.

Data Sourcing and Quality Challenges

There are some challenges in dealing with data from the SIS and LMS. First, we’re dealing with multiple heterogeneous datasets that include structured, semi-structured, and unstructured data. We may also deal with data silos often present in these scenarios where data about students’ lives evolve in the SIS. At the same time, the LMS may have different and potentially inconsistent data about the same entities. The domain is highly dependent on unstructured, qualitative data embedding bits of information that are of utmost importance for learning outcome performance assessment. Unstructured data curation is a fundamental task for enabling data analytics and decision-making within eLumen Insights.

Data Curation

Let’s now look at one of the most critical unstructured and qualitative data elements, the learning outcomes, and what it takes to curate them for downstream data analytic purposes. Data curation tasks for learning outcomes involve extracting metadata from structured and unstructured data, mapping these data to well-known taxonomies in the domain, and identifying and tagging elements in the data to bring them forward to end users by highlighting them in their front-end application. Figure 2 shows how a learning outcome encodes educational objectives within the well-known Bloom’s taxonomy. Here, we can see that the learning outcome statement uses the word design highlighted in the statement and corresponds to the create level at the tip of the pyramid.

Figure 2. Mapping a learning outcome statement to a level of Bloom’s taxonomy.

Another example of natural language processing-based data curation is text classification and similarity applied to learning outcomes (see Figure 3). In the case of text classification, we need to identify whether a particular learning outcome corresponds to a course-level, program-level, or institutional-level learning outcome. In the example shown in Figure 3a, we have a case where the statement corresponds to a course-level learning outcome. Another type of data curation ( Figure 3b) finds the similarity between different learning outcomes that, although sharing the same goal, may use different wording and paraphrases. Here, we want to identify situations where separately written learning outcomes share the same goals. We do that by running text similarity techniques with the help of natural language processing tools.

Figure 3. Data curation performed on unstructured data that uses (a) text classification and (b) text similarity techniques.

Representation and Persistence

The purpose of this data curation task is to enable downstream analytics that would be difficult or impossible if done directly on unstructured and qualitative data. We developed suitable representation and persistence mechanisms for this (see Figure 4).

The first one, graph data, aims to capture the domain’s key entities and relationships as we extract latent knowledge in unstructured data like learning outcome statements. We implement this through graph databases. More specifically, we use AWS’s Neptune database, which enables us to leverage graph machine learning (graph neural network) to identify missing attributes and discover relationships among entities in the diagram. The second representation we use is data lakes with AWS Redshift. This allows us to arrange data and build dashboards, visualizations, and reports for analyzing student learning outcome performance using self-service analytics tools like Power BI and Tableau. The result empowers end-users in leveraging their local and domain knowledge. Finally, the overall architecture and arrangement of these technologies are made through a serverless architecture where we leverage patterns such as CQRS and Saga to keep the eventual consistency of our distributed store.

Figure 4. Representation and persistence of curated data in eLumen Insights.


Our Roadmap for eLumen Insights From A Data Perspective

One of the most important things we hear from our design partners and early adopters is that they’ve taken previously unuseful data and made sense of it in a committee meeting. You might spend half your committee meeting (manually) making sense of it. Our solution is much more generous to the end-user.

Natural language processing and machine learning open up a more natural engagement with the overall process. Whether it’s program pathways, micro-credentialing, badging, badge or achievement wallets, better curriculum mapping, or backward design insights, we give faculty better tools to evolve their courses and programs. Natural language processing and machine learning are both valuable tools in helping faculty reflect on and practice shared governance (see Figure 5). They’re not a replacement. They’re going to be super curators. Rather, using natural language processing, machine learning, and graphs (three key tools in our arsenal) helps identify and align what used to be unstructured data — which could take days, weeks, years, in many cases, for them to decide — in close to real-time, making sustainable tasks that used to be nearly impossible.

Figure 5. natural language processing and machine learning use cases for assessments.

Next Steps in Our Data Journey

Tons of data that used to go unused because they wouldn’t or couldn’t be rationalized can now be part of an ongoing and continuous quality cycle. For us, this faculty self-service and participation in shared governance is something that is going to be uniquely supported and enabled. Previous enterprise data systems required an expert to take data, enter it, curate it themselves, figure out what it meant, and then present it back to people. Now, a system does that with a very high degree of fidelity to the faculty’s original work.

With the recent advances in technologies such as Large Language Models (LLMs) and the services built on top of it, you can have a small data center to make many of these techniques useful in your apps. And platforms like AWS, Azure, and Google are starting to build things in. They are becoming required tools to build modern applications. The new world doesn’t want to be bothered to enter data. It wants to be told what a piece of data means. The solution is a tool to find the data, screen scrape it, pull it in, grab the unstructured data and its context, parse it, and explain what is there. Looking at its potential for the future, this set of techniques will undoubtedly unlock a lot of value in the EdTech domain and across many other industries.

Original post here.

Authored by Joel Hernandez (eLumen) and Carlos Rodríguez.

About Joel Hernandez:

Joel Hernandez has been an executive at eLumen since it was commercialized from the Minnesota State University system. He is a frequent speaker on assessment and competency-based education technologies and is one of the architects of the 1EdTech CASE and CLR standards.

About Carlos Rodríguez:

Carlos Rodríguez is a project leader and collaborator in the data space with over 15 years of experience applying data science research and innovation across multiple domains in the software industry. He’s applied this work to business process management systems, compliance, crowdsourcing, service-oriented computing, software security vulnerability management, and EdTech, among other domains.




Blending strategic insights and thoughtful design with brilliant engineering, we create durable technical solutions that deliver digital transformation at scale