Kusto Detective Agency — Election fraud (Part 2 of 5)
While browsing twitter, I came across Kusto Detective Agency — a gamified way of learning Kusto Query Language (KQL).
There are a set of five challenges that participants are required to solve using their KQL skills and earn badges. Challenges are in the form of cases that need to be investigated by a detective agency (the storyline is awesome :P)
To start, participants are required to create a free Kusto cluster and separate data ingestion scripts are provided on a case to case basis.
Challenges
Election Fraud
— — — — — — — —
This challenge features an election scenario between 4 candidates — Kastor the Elephant, Gaul the Octopus, William (Willie) the Tortoise and Poppy the Goldfish.
- It turns out that Poppy the Goldfish is the ex-mayor’s fish and obtained 51.7% of all votes.
- However polls indicated a close battle between Kastor and Gaul.
- We have been asked to investigate if voting fraud really happened?
Data provided to us includes: IP, anonymized id, vote, date-time — and the function used for counting the votes.
We are provided with an ingestion script, let’s run the same.
We have a pretty self-explanatory schema for the Votes table. Let’s checkout the vote function…
There maybe a possibility that a particular voting booth (IP) was hijacked and large no of votes added. Hence we will check individual number of votes per candidate per IP using an aggregation.
Votes
| summarize Poppy_votes = countif(vote=="Poppy"),
Kastor_votes = countif(vote=="Kastor"),
Gaul_votes = countif(vote=="Gaul"),
Willie_votes = countif(vote=="Willie")
by via_ip
Everything seems normal except that for Poppy we have a few IP’s that have exceptionally low count of votes as compared to the general distribution of votes per IP. (we will explore this later)
Since voter_hash_id is unique for every voter. Let’s also check that if more than one vote was cast by the same voter.
We find 9 voters that cast more than one vote. Although the number is very small to sway election results, let’s make a list of these voter id’s and then look them up on the original Voters table to check who they have voted to.
let doubleVotes = (Votes
| summarize Count=count(vote) by voter_hash_id
| where Count > 1
| summarize double_votes = make_list(voter_hash_id));
Votes
| where voter_hash_id has_any (doubleVotes)
| order by voter_hash_id
Looks like an uniform distribution. Maybe its by design 🤔 At this point I decided to take a hint.
Seems like our previous hunch was right. We may want to take a deeper look into those IP’s where Poppy had an abnormally low vote count. Before diving into the raw data I wanted to look at a few visualizations first.
Votes
| summarize Poppy_votes = countif(vote=="Poppy"),
Kastor_votes = countif(vote=="Kastor"),
Gaul_votes = countif(vote=="Gaul"),
Willie_votes = countif(vote=="Willie")
by via_ip
| order by via_ip
| render timechart
Indeed something is wrong with those set of IP’s. One way to find out will be sorting Poppy’s vote counts in the ascending order and looking for IP’s with lower votes.
Votes
| summarize Poppy_votes = countif(vote=="Poppy"),
by via_ip
| order by Poppy_votes asc
We can see a sudden increase in number of votes past a particular IP. Let’s create an aggregation of votes for a particular IP (in the higher vote count class) over 30 min intervals.
Votes
| where via_ip == "10.168.5.189"
| summarize Poppy_votes = countif(vote=="Poppy")
by bin(Timestamp,30m)
And things get suspicious now — 183 votes in a 30 min window. Let’s check the vote distribution over each sec.
Votes
| where via_ip == "10.168.5.189"
| where Timestamp > datetime("2022-10-01T10:00:00") and Timestamp < datetime("2022-10-01T10:30:00")
| summarize Poppy_votes = countif(vote=="Poppy")
by bin(Timestamp,1s)
| order by Timestamp asc
Yep! There cannot possibly be 33 or 47 votes cast in a second. Clearly the voting has been tampered with. Looking into other IPs (in the higher vote count class) gives us similar results.
A probable way to mitigate this tampering and find out the correct distribution of votes maybe that we consolidate all votes cast in 1 sec towards Poppy as one vote. Let’s try this out…
Votes
| where vote == "Poppy"
| distinct hour=datetime_part("hour",Timestamp),
minute=datetime_part("minute",Timestamp),
second=datetime_part("second", Timestamp),
via_ip
The resulting table (T1) that we get neutralizes the extra votes that Poppy gets by allowing only 1 vote per 1 sec. We will create another table (T2) from the Votes table by filtering out all votes for Poppy. We will finally create a union of both of them to find out the correct vote count.
let poppyVotes = (Votes
| where vote == "Poppy"
| distinct hour=datetime_part("hour",Timestamp),
minute=datetime_part("minute",Timestamp),
second=datetime_part("second", Timestamp),
via_ip);
let corrected_votes = (Votes
| where vote != "Poppy"
| union (poppyVotes));
corrected_votes
| summarize Count=count() by vote
| as hint.materialized=true T
| extend Total = toscalar(T | summarize sum(Count))
| project vote, Percentage = round(Count*100.0 / Total, 1), Count
| order by Count
Even this answer was not accepted on the portal. We looked at average no of votes cast per minute across the entire Votes table and decided to changed our correction criteria from 1 vote per sec to 1 vote per minute and the answer was accepted 😄
PS: I started learning KQL recently, feel free to reach out to me and correct me, incase you spot some inconsistencies I would highly appreciate that. And finally thank you very much for taking your time to read this post.