Programmatically validating NPIs

What is an NPI?

National Provider Identifiers (NPIs) are used for uniquely identifying healthcare providers across a wide of variety of healthcare applications such as billing and payment, compliance, personal health records, and healthcare eligibility and enrollment. NPIs are publicly available and can be searched for in the CMS National Plan and Provider Enumeration System (NPPES). Any health care provider who transmits any health information in electronic form is registered through CMS and will have an NPI. There are two types of provider entities that are assigned NPIs:

Type 1: Individual. These identify people, such as physicians, nurses, dentists and technicians. For example, Jane Doeblin, an OB/GYN in Rochester NY, has the NPI 1780647248
Type 2: Organizational. These identify anything that is not a person, such as a hospital, group, laboratory, or nursing home. For example, the Mount Sinai hospital at 1468 Madison Ave has the NPI 1629410592

Why may they not be valid?

Invalid NPIs can occur frequently due to a mix of reasons depending on the source of the data:

Typo / human error due to manual entry
Unique identifier that is not an NPI, e.g. license number or internal unique identifiers for a specific organization
Leakage of other non-unique information e.g. taxonomy code or other billing related codes
General data uncleanlieness: for example, erroneous string padding, missings, bad defaults, etc

Because NPIs are used in so many places, it may often be necessary to check that an NPI is valid to identify usable records or enforce form validation.

Method 1: Dataset comparison

The most reliable way to identify valid NPIs is using the NPPES dataset directly, which can be downloaded as a full replacement or incremental file from their website

This lends itself well to databases, for example:

select a.npi
    , nppes.npi is not null as is_valid_npi
    , nppes.entity_type_code
    , ...
from my_dataset as a 
left join nppes
    on a.npi = nppes.npi

Pros	Cons
Comprehensive, no chance of false positives (NPI is labeled as valid but it is not) Entity type code and other relevant data also available for additional filtering	NPPES dataset needs to be updated regularly: new providers get added and providers become inactive weekly Easy to implement as part of an ETL inside a database, but loading into memory for real time querying should be avoided; in December 2021, the full NPPES dataset was 8.5gb and included 7,122,815 providers. Consider instead: Using the NPPES API Bloom filters if false positives are tolerable

Method 2: Simple validation

A more programmatic approach is validating the identifier based on the definition. From CMS:

An NPI is a 10 digit numerical identifier for providers of health care services. It is national in scope and unique to the provider… The number itself is not a “smart” number, i.e., there is no intelligence built into it

In addition, all NPIs must start with 1 or 2. We can construct a simple function that can check quickly that the identifier meets these requirements:

def is_npi_valid(npi: str) -> bool:
    return all([
        bool(npi),
        npi.isdigit(),
        len(npi) == 10,
        npi[0] in ("1", "2"),
    ])

Pros	Cons
Simple to understand Computationally simple	Gets the job done for obviously invalid data Potentially many false positives. For example 1234567890 will never be a valid NPI

Method 3: Check digit validation

While NPI is not a smart identifier and is randomly generated (eg not incremental - new NPIs assigned via a scattering algorithm), there is a built-in check digit using Luhn’s algorithm that can be used to validate the number, similar to a credit card number. More information about how CMS uses this can be found here

Using a simplified version of Luhn’s (input is 10-digit and an NPI) in python could look like:

def is_valid_npi(npi: str) -> bool:
    if len(npi) != 10:
        return False

    ret = {0:0,1:2,2:4,3:6,4:8,5:1,6:3,7:5,8:7,9:9}
    s = 24 + sum(int(d) for d in npi[-3::-2]) + sum(ret[int(d)] for d in npi[-2::-2])
    return 9*s % 10 == int(npi[-1])

Pros	Cons
Still computationally simple Identifies all potential future NPIs successfully, and significantly reduces the number of potential false positives. For example, while the simple validation approach above will mark every number between 1,000,000,000 and 2,999,999,999 as valid, the check digit will mark only a tenth of the numbers in this range as valid (100M)	More complicated that the simple approach Even if the NPI passes the check digit, it still may not be an active or currently in-use NPI

Summary

Three approaches were presented for NPI validation, along with simple python and sql implementations and the pros/cons of each. Selecting the method best for your use case will depend highly on:

Point in time of validation: Are you doing form validation, or just cleaning up data inside your database? How fast does the validation need to be?
Cleanliness of NPIs: How often are NPIs invalid? Do you know why, and is it a problem you created or that you can control? Are most invalid NPIs 9-digits? Do most contain alphabetical characters?
Potential consequences of false positives: What happens if you have an invalid NPI in your final dataset? Do you care? Do downstream processes require or assume NPI to be unique?
Desired complexity & scale: How much infrastructure and time are you willing to put into this issue? How many NPIs do you have, how often will you receive new ones, do you needs to check that the NPI is both valid and active?