From Messy Notes to Reliable Insights

From Messy Notes to Reliable Insights

In this project, we used Natural Language Processing (NLP) to Clean Up aircraft maintenance records, leading to improved operational information and decision-making.

Publication Nov 19, 2022

Rob Keefer

In environments like aircraft maintenance, accurate data drives everything, including staffing decisions, inventory planning, reliability predictions, and ultimately flight safety. Yet aircraft maintainers often record their work assuming a human colleague will read and understand it later. Structured codes (the fields systems actually rely on for reporting and analytics) can be incomplete or inaccurate, while the free-text narrative tends to be more accurate. Unfortunately for an AI system, these highly skilled technicians speak in a shared shorthand of acronyms, part numbers, and procedure references, making it difficult for a Natural Language Processing (NLP) system to process the free-text narrative.

Subject matter experts familiar with these records estimate that the coded fields are accurate only about 60% of the time. Maintainers might select the wrong code due to time pressure, misremember a category, or simply fail to align the action taken with the available options. This results in downstream systems presenting misinformation, which in turn leads to decisions based on flawed inputs. Operational situational awareness suffers.

The goal of this research was to use the more reliable unstructured text to predict and correct the structured codes. To solve this problem, we developed a practical, human-in-the-loop natural language processing (NLP) pipeline tailored to the peculiarities of short, jargon-heavy maintenance narratives.

The Unique Challenges of Maintenance Text

Maintenance records do not use common language as is found in news articles or customer reviews. They're terse: a few words or phrases packed with domain-specific abbreviations. For example, an entire narrative could be "IAW TO-001-8327," "R2 LT tire,” or “TCTO”. These are concatenated terms, with inconsistent punctuation and little conventional sentence structure. Standard off-the-shelf NLP tools train their models on long, grammatical English, such as news articles. These tools struggle in this space. Tokenization breaks important phrases apart, lemmatization misses synonyms maintainers use interchangeably, and common words like "IAW" (in accordance with) are not filtered out, though they add no value to understanding.

An exploratory analysis of over a million C-17 records confirmed how little the standard NLP tools would help. Of the one-word tokens in our training data, more than half weren't standard English words. They were acronyms, codes, misspellings, or shortened phrases unique to the maintainer community. Out-of-the-box processing wasn't enough to handle this.

A Tailored NLP Pipeline That Respects the Domain

The solution was a custom pipeline using open-source libraries, designed from the ground up for this kind of text:

Custom tokenizer: Splits intelligently, preserving critical multi-word phrases like task order numbers or part identifiers.
Synonym and lemmatization steps: Normalizes variations ("perform," "performed," "performing") to their root while respecting domain context.
Domain-specific stop-word removal: Filters out high-frequency but low-information terms that would otherwise dilute meaning.
TF-IDF vectorization: Turns the cleaned tokens into numerical features suitable for machine learning.

These processed narratives were processed by supervised classifiers (primarily Stochastic Gradient Descent with multinomial logistic regression). Hyperparameters were tuned efficiently using Randomized Grid Search, cutting compute time dramatically while maintaining high performance.

To handle the extreme imbalance in code frequencies (some codes appear thousands of times, others barely a handful), the approach is split into two models:

One trained on high-frequency codes.
Another on low-frequency ("sparse") codes.

A first-pass classifier determines which model to use, combining the strengths of both for an overall accuracy of 90%.

The pipeline predicts not just one code but the top four most likely options for each of five key fields (When Discovered, How Malfunctioned, Type Maintenance, Action Taken, and Work Unit). Averaged across all fields, the top-four predictions hit 93% accuracy, a significant improvement over the estimated 60% human-entered baseline. This system also supported sub-second responses for real-time mobile or web applications.

Why This Matters for Human-Machine Teaming

This work wasn’t intended to replace maintainers or automate their expertise. It's about partnership. Maintainers already provide the most accurate description in their narrative; the system simply extracts and leverages that truth to improve the structured data everyone else depends on. The cleaned data enables improved analytics, better inventory forecasting, more precise staffing, and, ultimately, stronger operational awareness.

In complex technical domains, the real value often lies in respecting how humans actually work, and then building tools that augment rather than fight against those habits. By starting with the maintainer's own words and applying thoughtful, domain-aware processing, we can turn noisy, human-generated records into reliable, actionable information. At POMIET, we see this as classic human-machine synergy: maintainers bring domain knowledge and real-world context; algorithms bring scale, consistency, and pattern detection. Together, they produce outcomes neither could achieve alone.

This work was published in Maintenance, Reliability and Condition Monitoring.

Looking for a guide on your journey?

Ready to explore how human-machine teaming can help to solve your complex problems? Let's talk. We're excited to hear your ideas and see where we can assist.

Let's Talk