Use Benford's Law To Detect Fraud - Python
This is a homework in DSO 562 Fraud Analytics class by professor Stephen Coggeshall, USC. This homework uses a credit card transaction data set.
Benford's Law
All images, data and Python Codes can be found here
The theory of Benford’s Law is a non-intuitive fact that has been around since 1881 but wasn’t applied to financial data until 1989 by Mark Nigrini. The theory is that first digit of many measurements is not uniformly distributed, and low-digit numbers 1, 2, and 3 show up more frequently than higher numbers 4 through 9. The chart below represents the percentage of frequency the first digit should show up in a population:
While Benford’s Law should not be used as a final decision making tool by itself, it may prove to be a useful screening tool to indicate that a set of financial statements deserves a deeper analysis.
Data Set
Credit Card Transactions in 2010 from governmental organizations. The data has been manipulated to serve the academic purpose of building a supervised fraud algorithm. The dataset has 96,753 records and 10 fields.
Also, we only consider P transactions and exclude transactions from FedEx. There’re 84,623 records in this set.
Build a model
According to Benford's Law, low-digit numbers (1,2) acount for about 47.7%. We will group all transactions by Merchandise Number and calculate this ratio for each group. Ideally, these ratio is close to 1, so we will highlight the groups, in which this ratio is far from 1. We then do the same process for Card Number
Besides, we need to use a smoothing formula to drive the ratios closer to 1 in cases when the group is too small. For a group with a small size of members, the distribution is not representative. By making the ratios closer to 1 for such a group, we avoid highlighting these groups in the final step.
Step 1: Get the first digit of transaction amounts.
Since the amount column is of dollar currency, by multiplying all values by 100, we’re confident that the first digit is non-zero. We doublechecked and confirmed that.
Step 2: Define a function that measures unusualness and apply smoothing function
'stat' in this code represent the level of unusualness, the higher the 'stat' the more alarming.
Step 3: Group the data by Merchnum and Cardnum and apply the custom formula on ‘First_Digit’ columns.
Below is the example for Cardnum
Step 5: Sort values by the unusualness scores and get 40 records with the highest scores.
Doublecheck
I notice this merchandizer with the highest unusualness score: infinity
Now look at the details of this merchandizer
You can tell how unusual it is. The merchandizer charged the same amount of money and charged only one card number. The transactions occured several times a day in some days. However, if a barbershop has only one customer, and that customer requests the same service every time, and he comes sometimes quite often, there's nothing fraudulent here. Again, this tool serves as a simple screening method, we need extra efforts to detect fraud with higher accuracy.
Last updated