Reggie the Raccoon and the Curious Statistical Downfall of Sandra From Accounts
A Cautionary Tale About Fraud, Maths, and Why You Should Never Let a Raccoon Manage Your SIEM
Reggie was elbow-deep in a bin outside a Pret a Manger on Farringdon Road when his pager went off.
His pager. Not his phone, his pager. Reggie had strong opinions about phones, most of which centred on the fact that smartphones were a security liability wrapped in a dopamine delivery mechanism, and that anyone who used one for work communications deserved whatever MITM attack they subsequently received. His colleagues at Morridian Fins IT Security team had stopped arguing about this approximately fourteen months ago, around the same time they had stopped arguing about his insistence on running Arch Linux, his refusal to use a mouse, and the persistent smell of rotting croissant that had become, over time, simply part of the ambient character of the server room.
He read the pager message, extracted a surprisingly intact pain au chocolat from beneath a discarded Metro newspaper, and began the short walk to the office. Something in the accounts payable database had triggered his anomaly detection script. This was either a crisis or a false positive. In Reggie's experience, it was usually both simultaneously, which was one of the less charming properties of statistical detection that nobody mentioned in the documentation.
Reggie had come to statistical fraud detection the way most security professionals come to anything useful: through a combination of catastrophic failure and sheer bloody-mindedness. Three years prior, Morridian Fins had suffered a fairly spectacular insider fraud incident, a mid-level finance manager named Derek had been submitting fabricated invoices for eighteen months before anyone noticed, primarily because the rule-based monitoring system had been configured to alert on invoices above £50,000, and Derek, with the lateral thinking of a man who had watched one too many true crime documentaries, had kept everything just beneath that threshold.
The post-incident review had been, in the understated tradition of British corporate culture, described as "a learning opportunity." Reggie had described it differently, in terms that earned him a formal note from HR and the undying respect of approximately half the security team.
What emerged from the wreckage was a mandate: build something better. Reggie, who had been reading about Benford's Law at 2 AM in his flat in Elephant and Castle while eating cold noodles directly from the container, thought he knew where to start.
Benford's Law, for those who have not spent their leisure time reading papers by 1930s physicists, is one of those mathematical phenomena that sounds made up until you actually look at the data, at which point it becomes one of those things you cannot stop seeing everywhere, like a particularly aggressive advertising campaign or the number of people on the Central line who are breathing directly onto the back of your neck.
Frank Benford observed in 1938 that in naturally occurring datasets, financial transactions, river lengths, population figures, the number of followers that LinkedIn thought-leaders claim to have, the leading digit is not uniformly distributed. About 30% of numbers begin with 1. About 18% begin with 2. By the time you reach 9, you are down to less than 5%. This is not a coincidence and it is not magic; it emerges from the logarithmic properties of data that spans multiple orders of magnitude, which is a sentence that sounds complicated but essentially means: numbers that occur naturally in the real world have this shape, and numbers that humans invent tend not to, because humans are psychologically terrible at being random.
Reggie had written a Python script, commented extensively because he had once inherited an uncommented codebase that had given him what he described as "a spiritual injury", that ran nightly against the accounts payable database and computed the chi-squared distance between the observed leading digit distribution and the Benford expectation. A p-value below 0.05 triggered a pager alert. This was, in the scheme of things, an elegant solution. It was also the reason he was now walking through Clerkenwell at 11 PM with a pain au chocolat tucked under one arm and the quiet certainty of a raccoon who has seen things.
He arrived at the office, badged in, exchanged a nod with the night security guard, a former engineer named Phil who had made, by his own account, "a series of decisions" that had led to him sitting in a lobby in the City of London reading a John Grisham novel, and descended to the server room.
The server room was Reggie's natural habitat in a way that no outdoor space had ever quite managed to be. It was cool, it was dark, it smelled of ozone and slightly burnt dust, and no one came down here unless they needed something or had been sent to apologise. He settled into his chair, an ageing Aeron he had rescued from a skip on Bishopsgate, which he considered the single greatest achievement of his career, pulled up his terminal, and looked at what the script had found.
Derek's replacement, a woman called Sandra from Accounts, had been submitting expense claims in which approximately 38% of the line items began with the digit 5. The Benford expectation for leading digit 5 is 7.9%. This was not a rounding error. This was not a coincidence. This was, statistically speaking, a person who had decided that numbers beginning with 5 felt pleasingly round and professional, and had not reckoned with the possibility that a raccoon running chi-squared tests at midnight would find this interesting.
Reggie forwarded the report to his manager with a subject line that read: "Sandra. Again. You'll want to read this before tomorrow's standup." He then ate the pain au chocolat and opened a second terminal window, because the Benford alert was not, as it turned out, the only thing that needed his attention tonight.
The SIEM had also flagged something on the network side, specifically, a user account in the trading division whose daily data transfer volume had returned a Z-score of 4.2, which in statistical terms means "this observation is so far from the mean that it should barely exist," and in security terms means "someone is either exfiltrating data or has discovered that the office Wi-Fi reaches the café next door and has been backing up their personal NAS."
Reggie had implemented Z-score-based behavioural baselines across all user accounts approximately eight months ago, after an incident he referred to in documentation as "the Henderson Situation" and in conversation as "the reason I no longer trust anyone who describes themselves as a 'team player' in their LinkedIn bio." Henderson had been a contractor in the analytics division who had spent three months gradually increasing the volume of data he was copying to an external drive, never enough to trip a static threshold, always enough to constitute, in aggregate, roughly 40GB of proprietary trading data that subsequently appeared in the possession of a competitor. The investigation had been deeply satisfying in the way that locking the stable door after the horse has not just bolted but taken the car as well tends not to be.
The Z-score approach was the correction. Rather than applying a fixed threshold, a number pulled from industry guidance documents written by people who had clearly never operated a production SIEM, Reggie's system built a rolling 30-day behavioural baseline for each individual user. The anomaly threshold was calibrated to that person's normal behaviour. A data scientist who routinely transferred large files had a large baseline. A compliance officer whose daily data movement could be measured in kilobytes had a correspondingly modest baseline. The system asked a simple question: is what this person is doing right now consistent with what this person always does? When the answer was no, the Z-score moved. When it moved far enough, the pager went off.
For non-normal distributions, which in practice meant most of them, because security telemetry is almost always right-skewed due to the handful of automated processes and power users whose behaviour would look alarming if you forgot to account for them, Reggie used the Interquartile Range method instead. The IQR, being derived from the middle 50% of a distribution rather than the mean, did not care about the outliers at the top end distorting the picture. It was robust. Reggie appreciated robustness in the way that a person who has been badly let down by fragility comes to appreciate it.
The trading division account turned out to be a developer who had been migrating a legacy application and had neglected to mention this to anyone. Reggie documented the incident, sent a terse email to the developer's manager, and added a note to the onboarding checklist that read "please inform Security before initiating large data migrations, or we will assume the worst about you, which is our default assumption anyway."
It was now approaching 1 AM and Reggie, fortified by a second visit to the kitchen on the third floor where someone had left a box of extremely good shortbread from a meeting that had presumably ended better than most of Morridian's meetings tended to, turned his attention to the machine learning pipeline he had spent the better part of six months building and the last two months trying to explain to management in terms they could process.
The fundamental challenge of applying machine learning to intrusion detection, and Reggie had explained this in three separate presentations, two written reports, and one memorably terse exchange in the lift with the CISO, is that the data is absurdly imbalanced. On a given day, Morridian's network generated somewhere in the region of 80 million flow records. Of these, in a good week, perhaps 200 were genuinely malicious. This is a ratio of approximately 0.00025%, which is the sort of number that makes supervised classification feel, if you are being honest with yourself, somewhat heroic.
The naive approach, and the approach that Reggie had found, upon reviewing the previous security team's work, had indeed been the approach taken, was to train a classifier on this data and evaluate it on accuracy. The previous classifier had achieved 99.97% accuracy. It had also achieved this by predicting "benign" for every single network flow it had ever seen, which meant it had detected precisely zero attacks while generating the kind of metrics that looked excellent in a quarterly board report. Reggie had written a one-page summary of this finding that began with the words "I have some news" and ended with a recommendation that the previous system be decommissioned immediately, which recommendation had been accepted after a period of institutional denial that he estimated at approximately three weeks.
His replacement used SMOTE, Synthetic Minority Oversampling Technique, an acronym that even Reggie admitted sounded like something a 1990s network admin had invented while watching The Matrix, to generate synthetic minority class samples by interpolating between genuine attack records in feature space. Rather than duplicating existing malicious samples, SMOTE created plausible variations of them, giving the classifier a more varied and representative picture of what attacks looked like without simply memorising the specific attacks in the training data. Combined with cost-sensitive learning that penalised misclassification of attack samples more heavily than benign ones, and evaluated using the Matthews Correlation Coefficient and the Area Under the Precision-Recall Curve rather than accuracy, the system was considerably more honest about what it could and could not do.
It was, Reggie would tell you, not evaluated using ROC-AUC, and if you asked him why he would tell you, at length, with examples, in a tone that suggested he had given this explanation before and had not found the experience cathartic, that a classifier can achieve a ROC-AUC of 0.99 on a severely imbalanced dataset while having a precision of 0.03, meaning 97% of its alerts are false positives. He had a slide about this. He had shown the slide to people who made decisions. He was not sure the slide had helped, but it had made him feel better, which he had come to accept was sometimes the best available outcome.
The ML pipeline had, over the past two months, flagged three genuine incidents that the rule-based system had missed entirely: a slow credential stuffing campaign distributed across twelve thousand accounts at one attempt per account per day, a DNS tunnelling exfiltration attempt so leisurely it had been running for eleven days before the classifier caught it, and one incident that Reggie had internally classified as "baffling" in which someone had been attempting to probe the internal network from a conference room on the fourth floor using equipment that turned out to belong to a penetration testing firm that had been booked, quietly and apparently without telling IT Security, by the Head of Risk. This last incident had prompted a meeting. The meeting had gone exactly as well as meetings about penetration testing firms discovered mid-operation tend to go, which is to say it had been very lively and someone had used the phrase "communication breakdown" at least four times.
At 2:30 AM, with the Sandra report filed, the developer incident documented, the ML pipeline logs reviewed, and the shortbread finished, Reggie put on his jacket and took the lift to the ground floor. Phil the night security guard looked up from his novel.
"Anything interesting?" Phil asked. Phil had worked nights long enough to understand that "interesting" in a security context almost never meant anything good.
"Sandra," Reggie said.
Phil looked perplexed. "Again?"
"Different Sandra. Well. Same Sandra, different month." He tucked his badge into his jacket. "Someone in trading moved half the network without telling anyone, which I've documented, and the ML pipeline caught a credentials thing that's been running for about a week, which I've also documented, and upstairs they're apparently looking at a vendor product someone saw at Infosecurity Europe, which I will document when it becomes my problem, which it will."
Phil considered this. "So. Normal Tuesday."
"It's Wednesday."
"Right."
He pushed through the revolving door into the cold London air, crossed the empty street, and disappeared into the dark in the direction of Clerkenwell, a raccoon with a pager and a working knowledge of chi-squared distributions and the unshakeable conviction that the numbers, if you were willing to look at them properly, would always tell you something true.
The city hummed around him, full of logs and transactions and network flows and behavioral traces, all of it generating data at a rate that no human team could meaningfully process, all of it quietly carrying the statistical fingerprints of what people were actually doing versus what they claimed to be doing. Reggie found this comforting in the specific way that only a nocturnal animal who has made peace with the fact that most security incidents happen while everyone else is asleep can find it comforting.
His pager went off again.
He sighed, fished a slightly damp samosa from a bin outside a shuttered curry house on St John Street, and turned back towards the office.
It was going to be one of those nights. It was always one of those nights. That, in the end, was the job.
Reggie's anomaly detection scripts are available on his GitHub (and also below the donate button), which has eleven stars, nine of which are from him testing the starring functionality. The SIEM is Splunk. The shortbread was from Waitrose. Phil finished the Grisham novel and has started on Lee Child. Sandra has been referred to HR.
Statistical fraud detection is free. Reggie's Aeron chair cost nothing because he found it in a skip on Bishopsgate. This article also cost you nothing, which means Sandra has already won. Donate to make sure she doesn't.
#!/usr/bin/env python3
# benford_audit.py
# Reggie T. Raccoon, Morridian Fins IT Security
# Last modified: 02:17 AM, because of course it was
#
# Runs nightly via cron. If this wakes you up, that's the point.
# Do not "optimise" this script. Do not "refactor" this script.
# Do not touch this script at all unless you are me.
# If you are Derek, especially do not touch this script.
# Derek no longer works here. This script is part of why.
import sys
import logging
import numpy as np
import pandas as pd
from scipy import stats
from datetime import datetime
from pathlib import Path
# ----------------------------------------------------------------
# CONFIG
# If you are changing these values without telling me,
# I will find out. That is literally what this script does.
# ----------------------------------------------------------------
DB_PATH = "/var/morridian/finance/ap_database.csv"
LOG_PATH = "/var/log/reggie/benford_audit.log"
ALERT_THRESHOLD = 0.05 # p-value below this = someone has been creative
MIN_SAMPLE_SIZE = 1000 # below this, Benford's is unreliable.
# do not run this on twelve invoices and declare fraud.
# I am looking at you, Graham from Internal Audit.
# ----------------------------------------------------------------
# LOGGING
# Verbose because I want a paper trail.
# I always want a paper trail.
# The paper trail is the only thing that saved us during
# the Henderson Situation and I will not elaborate further.
# ----------------------------------------------------------------
logging.basicConfig(
filename=LOG_PATH,
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
log = logging.getLogger(__name__)
def benford_expected() -> np.ndarray:
"""
Returns the Benford's Law expected frequencies for digits 1-9.
Formula: P(d) = log10(1 + 1/d)
This is not magic. This is not a coincidence.
This is mathematics, and it will find you.
"""
digits = np.arange(1, 10)
return np.log10(1 + 1 / digits)
def extract_leading_digits(series: pd.Series) -> pd.Series:
"""
Extracts the leading digit from each value in a numeric series.
Strips sign, decimals, leading zeros, the lot.
Will ignore zeroes, nulls, and anything that looks like
someone has been creatively formatting their spreadsheet.
Sandra.
"""
series = series.dropna()
series = series[series != 0]
# Convert to absolute values because negative invoices are
# a separate problem and also deeply suspicious on their own
series = series.abs()
leading = series.astype(str).str.lstrip('0').str[0]
leading = leading[leading.str.isdigit()]
leading = leading.astype(int)
return leading[leading.between(1, 9)]
def run_benford_test(df: pd.DataFrame, column: str, entity: str) -> dict:
"""
Runs chi-squared goodness-of-fit test against Benford's expected
distribution for a given numeric column.
Returns a results dict containing:
- observed frequencies
- expected frequencies
- chi-squared statistic
- p-value
- a verdict, which is either 'CLEAN' or 'INVESTIGATE'
and occasionally 'OH NO'
Parameters:
df : the dataframe. Should be clean. Rarely is.
column : the numeric column to test. Usually amounts.
entity : name of the person/department being analysed.
Used in logging. Used in the report.
Used, ultimately, in the HR referral.
"""
log.info(f"Running Benford analysis | entity={entity} | column={column}")
values = extract_leading_digits(df[column])
n = len(values)
if n < MIN_SAMPLE_SIZE:
log.warning(
f"Insufficient sample size for {entity}: {n} records. "
f"Benford's Law requires >={MIN_SAMPLE_SIZE}. Skipping. "
f"This is statistics, not astrology."
)
return {"entity": entity, "verdict": "INSUFFICIENT_DATA", "n": n}
# Observed frequencies per digit 1-9
observed_counts = np.array([
(values == d).sum() for d in range(1, 10)
])
observed_freq = observed_counts / n
# Expected frequencies per Benford's Law
expected_freq = benford_expected()
expected_counts = expected_freq * n
# Chi-squared test
# If p < threshold, the distribution is improbable under Benford's Law.
# This does not mean fraud. It means: go and look.
# It has, historically, meant fraud.
chi2_stat, p_value = stats.chisquare(
f_obs=observed_counts,
f_exp=expected_counts
)
verdict = "CLEAN" if p_value >= ALERT_THRESHOLD else "INVESTIGATE"
# Special case. You'll know it when you see it.
if p_value < 0.001:
verdict = "OH NO"
result = {
"entity" : entity,
"column" : column,
"n" : n,
"chi2_stat" : round(chi2_stat, 4),
"p_value" : round(p_value, 6),
"verdict" : verdict,
"observed_freq" : dict(zip(range(1,10), observed_freq.round(4))),
"expected_freq" : dict(zip(range(1,10), expected_freq.round(4))),
"timestamp" : datetime.now().isoformat()
}
log.info(
f"Benford result | entity={entity} | "
f"chi2={chi2_stat:.4f} | p={p_value:.6f} | verdict={verdict}"
)
if verdict != "CLEAN":
log.warning(
f"ANOMALY DETECTED: {entity} | p={p_value:.6f} | "
f"Digit-5 observed freq: {observed_freq[4]:.2%} "
f"(Benford expected: 7.90%) | "
f"Forwarding to pager. Someone is having a bad night. "
f"It is not me."
)
return result
def load_expense_data(path: str) -> pd.DataFrame:
"""
Loads accounts payable data from CSV.
Yes it's a CSV. I know. I have raised this with management
four times. The database migration is 'on the roadmap.'
The roadmap was last updated in 2019.
"""
try:
df = pd.read_csv(path)
log.info(f"Loaded {len(df)} records from {path}")
return df
except FileNotFoundError:
log.critical(
f"Database not found at {path}. "
f"Either the path is wrong or something has gone "
f"considerably more wrong than a missing file. "
f"Check both possibilities with equal urgency."
)
sys.exit(1)
def main():
log.info("=" * 60)
log.info("Benford Audit — Morridian Fins IT Security")
log.info(f"Run started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
log.info("=" * 60)
df = load_expense_data(DB_PATH)
# Group by submitter and run individual Benford tests
# This is the part where we find out who has been creative.
results = []
for entity, group in df.groupby("submitted_by"):
result = run_benford_test(
df=group,
column="amount_gbp",
entity=entity
)
results.append(result)
# Summary report
flagged = [r for r in results if r.get("verdict") not in ("CLEAN", "INSUFFICIENT_DATA")]
log.info(f"Audit complete. {len(results)} entities analysed. "
f"{len(flagged)} flagged for investigation.")
if flagged:
log.warning("FLAGGED ENTITIES:")
for r in flagged:
log.warning(
f" {r['entity']} | p={r['p_value']} | "
f"verdict={r['verdict']}"
)
# Page the on-call analyst.
# That's me. I'm always the on-call analyst.
# I did not fully think through the implications of
# building this system when I built this system.
trigger_pager_alert(flagged)
else:
log.info("Nothing flagged. Either everyone is behaving themselves "
"or someone has found a way to fool a chi-squared test. "
"Both outcomes are interesting. Only one is relaxing.")
def trigger_pager_alert(flagged_entities: list):
"""
Sends alert to Reggie's pager.
Yes, a pager. No, I will not be taking questions.
"""
import subprocess
names = ", ".join([e["entity"] for e in flagged_entities])
message = f"BENFORD ALERT: {len(flagged_entities)} entity/entities flagged: {names}. Check logs. Do not panic. Do panic a little."
# sendpage is a custom wrapper around the office paging system
# that I wrote in 2019 and that has worked perfectly ever since
# which is more than I can say for the database migration
subprocess.run(["sendpage", "-u", "reggie", "-m", message], check=True)
log.info(f"Pager alert sent. Reggie is now awake. Reggie is always awake.")
if __name__ == "__main__":
main()