Use Benford's Law To Detect Fraud - Python
This is a homework in DSO 562 Fraud Analytics class by professor Stephen Coggeshall, USC. This homework uses a credit card transaction data set.
Last updated
This is a homework in DSO 562 Fraud Analytics class by professor Stephen Coggeshall, USC. This homework uses a credit card transaction data set.
Last updated
All images, data and Python Codes can be found here
The theory of Benford’s Law is a non-intuitive fact that has been around since 1881 but wasn’t applied to financial data until 1989 by Mark Nigrini. The theory is that first digit of many measurements is not uniformly distributed, and low-digit numbers 1, 2, and 3 show up more frequently than higher numbers 4 through 9. The chart below represents the percentage of frequency the first digit should show up in a population:
While Benford’s Law should not be used as a final decision making tool by itself, it may prove to be a useful screening tool to indicate that a set of financial statements deserves a deeper analysis.
Credit Card Transactions in 2010 from governmental organizations. The data has been manipulated to serve the academic purpose of building a supervised fraud algorithm. The dataset has 96,753 records and 10 fields.
Also, we only consider P transactions and exclude transactions from FedEx. There’re 84,623 records in this set.
According to Benford's Law, low-digit numbers (1,2) acount for about 47.7%. We will group all transactions by Merchandise Number and calculate this ratio for each group. Ideally, these ratio is close to 1, so we will highlight the groups, in which this ratio is far from 1. We then do the same process for Card Number
Besides, we need to use a smoothing formula to drive the ratios closer to 1 in cases when the group is too small. For a group with a small size of members, the distribution is not representative. By making the ratios closer to 1 for such a group, we avoid highlighting these groups in the final step.
Since the amount column is of dollar currency, by multiplying all values by 100, we’re confident that the first digit is non-zero. We doublechecked and confirmed that.
'stat' in this code represent the level of unusualness, the higher the 'stat' the more alarming.
Below is the example for Cardnum
I notice this merchandizer with the highest unusualness score: infinity
Now look at the details of this merchandizer
You can tell how unusual it is. The merchandizer charged the same amount of money and charged only one card number. The transactions occured several times a day in some days. However, if a barbershop has only one customer, and that customer requests the same service every time, and he comes sometimes quite often, there's nothing fraudulent here. Again, this tool serves as a simple screening method, we need extra efforts to detect fraud with higher accuracy.