This will be the first in a series of articles. The overall “big” picture will be for me to be able to create, host and run a fully end to end ML solution in the cloud, built from scratch, only using infrastructure as code. Meaning nothing is done in via the front end / GUI but all programmatically. Why, well to learn to become a OK’sh Machine Learning Engineer.
That is the end goal, however everything worth doing is worth overdoing, so I decided to start from scratch and build and train my own ML model that will eventually be hosted in said end to end ML solution. However, there is one more step before one can train your own model. Getting some date to train your model on. So, one can just download a clean dataset from Kaggle and start training right? Well keeping in line with overdoing everything I decided to create my own data set. And this is what this article is all about.
Since I work in the banking industry, I figured I might do something that is in line with what I do for a living. I decided to eventually train a anomaly detection model, specifically on transactional data found in bank statements. Now finding a lot of real bank statements, both legitimate and ones related to anomaly is a bit harder than one might think. Something to do with protecting identity and a thing called POPI. Just for the record I’m very glad this is hard to come by as I would not want any person to have access to my private banking transactions. So, what to do? Well, I decided to simply create my own bank statements from scratch and invent a new (totally made up) bank while I’m at it. Both to simulate “real” or normal transactions and anomaly ones, well to the best of my (and GPT’s) abilities that is.
Seems this is not as hard as I initially thought. Keep in mind my first attempt was just to ask GPT to generate totally realistic bank PDF statements that I can use to train a anomaly detection model on. This did not go as planned, but it was a good start. Seems using the PIL, and Faker python library one could get rather far. That along with some other tools that is.
First I had to start by generating some data that would eventually be placed on a template the could then in turn be converted to a PDF. As I mentioned even asking GTP to straight out generate random banking transactions was a bit over it’s head. I did try a couple more time to get it to generate more “realistic” statements, however seeing that I also wanted to try and generate statements that would be found in South Africa I decided to get a bit more hands on.
First I decided I need to have some existing examples of at least the descriptions of what one would normally find within a bank statement. I downloaded my own bank statements from all the banks I bank with in CSV format, sorted them into income and expenses and then created two lists, one only having descriptions for income and one for expenses. I then fed the exampled to GPT and asked it to generate more similar descriptions.
I ended up with two new lists, both with about 500 descriptions of income and expenses. I had to manually play some piano on my keyboard to replace any real account numbers that was obtained from my own bank statements
Next I also had to get random names and addresses. Again, note, all the names is made up and if it should correspond to you or someone you know it’s purely coincidence. I started off by simply asking GPT to generate random names, surnames and addresses. First I asked GPT4 and it, very politely refused and simply replied saying this would not be ethical. Even I explained the reason for me asking. However turns out HugginChat have no issued doing this. (OK, I now while writing this article tried asking GPT3 the same question and seems it’s moral values is a bit lower than GPT4’s as it also had no issue generating the data for me)
After some extensive prompting, reminding, guiding, and swearing at HuggingChat I eventually ended up with a list of about 500 randomly made up names and address. I also know that because all LLM’s is training on actual data some of the randomly generated names and address might correspond to real people and their actual addresses I took the liberty to scramble the addresses in Excel before saving the final dataset. Just to make sure it might not for some reason produce the actual address of Mr John Doe who lives in 123 Apple Lane.
Now that I have a set of more realistic details I started to put it all together. I decided to create two main scripts, one to generate real or normal transactions and one to try and generate simulated anomaly transactions.
Below is the script that GPT and I wrote to generate a list of transaction for a “normal” bank statement. One can set a few variables, like the amount of transactions, in this case a range between 50 and 200. Also one can set the amount of income and expenses based on a percentage. In in this case I made it 20/80 with 20 percent being income to more realistic simulate an actual bank statement. The script pulls from the lists I generated earlier to randomly pick a name, address, account number, and transaction description. The amount for these are randomly selected. I’m sure I could have spent more time finetuning this. For example. buying airtime for R9000 might not be normal, but I figured it was good enough. I did however customize the salary as it would not make sense to get more than one salary per month. I also set this to a range as getting a salary of R10.50 would not make sense. And lastly I hard coded it to only show on the 25’th of every month. The balance is calculated based on the random amounts after the transactions has been sorted by the date. Lastly it would write out all the date into a CSV file with the currant date and time so that multiple files can be generated.
All this would then result in a CSV file looking something like this:
name,address,from_date,to_date,printed_date,account_number Deidre Yeo,"46 Lilac Terrace, Brakpan, Johannesburg, 3044",2023-05-01,2023-05-31,2023-05-14,5686396521 2023-05-01,ATM Withdrawal in Tokelau,0,492.74,37718.92 2023-05-01,Card Purchase & Cashback Boxer Superstore Ermelo (Card 1234),0,264.08,77200.47 2023-05-01,Online Purchase: Agri Online Supplies (Card 6789),0,671.54,66338.03 2023-05-01,Eft Debit Order (1082102938): Discovery Health (DH1092830238),0,169.44,60211.01
This was the “easy” part. The fun part came when I wanted to start generating “anomaly ” bank statements, well at least anomaly transactions within a made up bank statement. So, how would one know if a statement contains anomaly or not. After doing some online research and speaking to a few people who knew much more than I do about the subject, I narrowed down a few viable options I could use to generate anomaly bank statements. Haha.. Just thinking if you would read this sentence out of context I think there might be some interesting conversation in my future.
Some of the possible anomaly behavior might include
- Repeated transactions for the same amount: anomaly often test a card with small transactions before moving on to larger ones. So, you might see a series of transactions for the exact same amount.
- Round number transactions: Legitimate transactions often have odd amounts due to tax, while anomaly transactions are often for round amounts (e.g., $100.00, $200.00, etc.).
- Large transactions: A sudden large transaction, particularly if it’s inconsistent with the account’s typical activity, can be a red flag for anomaly.
- Multiple transactions within a short timeframe: This could indicate that a anomaly is attempting to withdraw or transfer as much money as possible before the activity is detected.
- Account Dormancy: A long period of account inactivity, followed by a flurry of transactions, can be a sign of a dormant account that has been taken over by a anomaly.
- Multiple ATM Withdrawals: Multiple ATM withdrawals in a short period of time or ATM withdrawals that are unusually large could indicate ATM skimming or a stolen card.
- Transactions in Multiple Locations: If the transactions are happening in multiple cities or countries in a time frame that is impossible for a person to travel, it might be an indication of anomaly.
- Frequent Transactions with Same Merchant: Multiple transactions with the same merchant in a very short time period can be a sign of card testing anomaly.
- Inconsistent End Balance: In a tampered bank statement, the final balance might not tally with the sum of initial balance, deposits, and withdrawals. While this might not be a direct indicator of anomaly, it might indicate that the bank statement has been tampered with.
- High Number of Chargebacks or Returned Items: Multiple chargebacks or returned items could indicate a case of identity theft or anomaly transactions.
- Missing information. If you notice that any important information is missing from the statement, such as your account number or the name of your bank, it could be a sign of anomaly.
- Transactions that you didn’t authorize. If you see any transactions that you didn’t authorize, it’s important to contact your bank immediately.
Keep in mind, as pointed out to me more than once, most of these does not automatically imply that anomaly is taking place. It just increases the chances of the likely hood that some of these type of transactions might include some degree of illegal transactions.
I then took some of the above behaviors and incorporated it into a modified version of the script that produced the “legitimate” bank statements. The second script follow the basic outline as the first one, namely importing randomly generated data from the 3 datasets to get the names, addresses and transactions descriptions.
From there I generated a random number from 1 to 9 that corresponds to the type of anomaly to be added into the bank statement.
This is a short list and description on what each of the functions does:
- Generate 15 of the same type of transaction with the same small about being deducted in a small time period.
- Round all amounts off to the nearest 10,100 or 1000
- Append a abnormally large amount to the bank statement that falls outside of the norm., in this case 1 Million
- Have multiple transaction occur on one day where these there is none or the rest of the month.
- Increase the amount of ATM transaction to above the usual norm.
- Have ATM withdraws from around the world in a short period of time.
- Create a large amount of transactions all from a single merchant.
- Alter the balance to be inconsistent resulting in a imbalance between the transaction amount and the bank balance.
- Change the last amount on the bank statement by adding a random large number.
- Remove the account information from the generated dataset.
Now that the two types of datasets has been generated it’s time for even more fun. To convert the CSV file to a nice looking totally real bank statements.
First of I needed to create a template where I can then import the generated data. I downloaded some sample bank statement templates from the internet as first, however they where not “realistic” enough, perhaps it’s because these where based on US banks and I was just used to seeing statements from the banks I bank with.
So I decided to use the two banks I bank with as inspiration to come up with the below template in A4 JPG format. The name and logo was generated by AI (GPT and Midjourney) and edited by me. I just played piano on the keyboard to come up with the details like the tax number and address. As for the QR code. I will leave that as a surprise, but I did write a whole separate script just to generate that. For the fine print on the bottom I asked GTP to come up with most funny and non sensical banking terms and conditions it can. I also figured should any of these obviously not real statements get into the wild they can call the support number on the bottom of very page. Then I can personally ask them what they plan to do with the statements. I also decided to add the words “SAMPLE” just incase someone might accidentally take this for a real bank statement.
Below is the code to generate the QR code should anybody want to use it:
Now that I had the template and the CSV transaction logs, both “real” and “anomaly” it was time to put it all together. Basically it reads the information from the CSV file and then puts each section on a specified X and Y coordinate, This took the longest time as I had to generate countless PDFs, each time making small adjustments until I was happy where everything needed to go. The script would create a new page if there was more then 30 transactions and then also increment the page numbers accordingly. The CSV file had zeros if there was nothing in the column, but I decided to have ignore this when importing this into the PDF. There is also some minor text adjustments, for example the address field in the CSV is just one field, so I had to additionally separate this further, by removing spaces and adding a new line after each comma in the address field. Doing so the address would then be displayed with each part underneath each other instead of one line.
And the final result would look something like the one shown below, again with the sample added as a last step, just to be safe. And a prize for reading all the way to this part I’ve added a link where you can download a full size sample here -> PDFoutput20230514121108
So the last step now was simply to click “run” and generate as many sample bank statements as I want. In the next existing episode of Jaco ML news I will try to train multiple models on this dataset to see if at least one can determine the “real” bank statements from the “anomaly” ones.
So tune in sometime soon and find out!
Lastly if you might ask why I did not post this code on my github page. Well this code can be used for not so scientific methods with a few changes. On the other hand, the whole project took me about 10 hours in total, on and off over the last couple of weeks. And this is using GPT as my co programmer. I’m sure other who can actually do decent programming can do the same in not even half the time. But still if someone wants the code I might be convinced to share if the cause is legitimate and fully above board. Else, you are happy to rewrite everything from the screenshots of the code.