By Elvis Bando
A few weeks ago, @chrisorwa, showed me some datasets that he had been working on. I got an adrenalin kick by just looking at the data. Mostly because it was challenging, then again, the prospects of cracking the data was even more motivating (or may be a different motive all together ). In a previous blog, Chris wrote about trash sourcing, basically extracting information from trash, which was the basis of his project #Saiclique. When he invited me to join him in the project, I realized that some of the data he had were in a format I could not manipulate (we use different core analysis softwares, I use Rapid Miner, he uses Weka) so we had to start data entry process again. We started here
We ended up getting this from the cards (we did slightly over 1000 cards)
From this I generated this beautiful dataset
Looking at the sheet, you will note my initial hypothesis, and my final conclusion dataset.
I thought the serial must be a concartination of 4 sets of 4 digits, so I split the data into that. Running a DBSCAN clustering algorithm on Rapid Miner gave me this
The tall collumn is 0526, which was the second set in my 4-4-4-x data split. None of any other configurations had such strength. What does 0526 even mean. Just to confirm that 0526 meant something, I ran a frequency analysis of each digit in the entire serial and using Benford’s Law (the first digit is always 1, 30% of the time) I narrowed the cluster to 2-6-3-x configuration.
Not to bore you with my thought train, after numerous other modelling and analysis, the data finally spoke, here is the transcript:
Safaricom serializing system seems to be similar to, or based on descriptions of a patented system found at http://www.freepatentsonline.com/5504808.html
If true, then the card serial number contains information about a card, the date and time it was produced and unique identifier. The rest of the information are called by the code from the system (the called info could be the amount of talk time, the expiry etc). The serial is therefore the only unique identifier of a particular card and show whether or not it has been used or not.
An analysis of cards produced in 2010 and earlier indicate that they were sequential for most parts. The initial two digits was 10 throughout the year indicating, probably the year of production. The remaining parts were sequential. The change of this system was probably because they would have run out of state space. At that time, the serial was a 13 digit number, as opposed to the current 17 digits.
Safaricom prepaid card serial number is organized into
The batch number is a two numeral number running from 01-99. It is splits the batch of cards produced each hour to an approximately 10,000 cards. This ensures that they are easily identifiable incase there is theft or a problem.
ManDate is the date of production of the cards. It is written in the format yy-mm-dd. It is exactly 2 years to the expiry date.
Time is the approximate hour in which the cards were produced. It runs from 000-220 (with increments of 10, so we have 010..020…100..110..). In each hour, 99 batches of cards are produced (see Batch#).
Finally, there is the serial part which is a sequetial number. The data I have may be inconclusive but it shows that each hour (Time), about 1 million cards are serialized. All cards are serialized the same way, so there is no telling the value of a card from the serial (damn!).
The dataset could possibly have more information. This may be limiting in the current analysis as variables such as the location of collection of the cards, and the date of collection. This can possibly give a good picture of economic indicators, customer spendings and possibly zone spending regions.
The writer is the team leader, Doban Africa Ltd.