Friday, January 31, 2025
HomeAnalyticsThe Value is in the Data (Wrangling) — Darkhorse Analytics

The Value is in the Data (Wrangling) — Darkhorse Analytics



Next, start learning about your data sources. Where did it come from? How was it encoded? (By human or machine?) How is it stored? Are there discontinuities? Maybe they converted from an AS/400 at some point and everything thereafter has more decimals or more values. Once you’ve documented and understood your sources, you have reached the second stage: data sloughs (note, you have more than one).

Now comes the fun part: cleaning up your data. This means getting rid of the duplicates, dealing with the blanks, and fixing data type problems. It might mean harmonizing dates like 02/03/04 (Canada) and 02/04/03 (US), time zones, or even daylight savings time. Count the records in each of your tables and make sure the totals make sense. Maybe your extract only took the last year instead of the last five. Your goal here is to fix the obvious issues in each of the data elements. This is just cursory cleaning, but you now have reached stage three: data ponds.

After cleaning, it’s on to organizing your data. For most analyses (and to help with your cleaning) this means a single table. One. One table to rule them all.

Start by unwinding your cross-tabs so that each data element is only in a single column. Set up keys and then join the tables together. You might have to use dates or lat-longs or fuzzy join on names or addresses. You might end up with a really wide table with repeating elements. That’s fine. Space is cheap.

Make sure each variable (column) is properly described and understood. Is rec_date the date the order was received or the date the record was created? You might even want to build a data dictionary.

This is boring work, but once you’re done it, you’ve reached level four. Your ponds are now consolidated into a single data pond.

Now it’s time to create new data. What? Create new data? You heard me right. Very rarely will your dataset contain all the variables you want. The real gold is when you combine existing fields to form new ones. Here are some examples to prime your thinking:

  • If you have drive time and distance, calculate the average speed.

  • If you have revenue and quantity, calculate the average price.

  • If you have yearly revenues, create a percentage change in revenue

  • If you have age specific population by year, subtract this year’s eighteen-year-old population from last year’s seventeen-year-olds to get a net migration.

  • If you have donation data where extreme values dominate, put them into log space.

You get the picture. You are value-adding like a boss here.

Pay special attention to categorical data – it can be incredibly valuable. It often makes sense to create new categories out of continuous variables (high, medium, or low), or to consolidate existing ones. Sometimes, you need to re-categorize the past to match up to the current category definitions. Other times, you need to group something up from 90 categories to five super-categories.

Let your original business question be your guide, but don’t be afraid to venture outside of the specific problem. Follow your intuition. Sometimes you’ll identify trends or errors in these new data elements that weren’t obvious in the raw data.

Weeks will pass.

Finally, you have your dataset assembled. You’ve cleaned up the obvious issues and you have a pretty good idea of what you have. Can you start modelling now?

No, not even close. You’re only at level five. You’ve got yourself a little data lake, but its waters are brackish.

It’s time to start digging into the data content. This will perhaps be the longest stage, but it accomplishes two things: it ensures each variable (including your new ones) is internally consistent and it ensures that your relationships are logical.

Proceed visually. Summary statistics are not your friend – they may in fact lead you astray.

Internal consistency is achieved through histograms and the like. Start by plotting a frequency distribution of each variable, one at a time. Look closely at these for gaps, peaks, or outliers. Sometimes you’ll get a normal or lognormal distribution, sometimes it will be uniform. Ask yourself what you should expect to see before you peek.

Let’s say you want to plot net migration by age. What do you expect to see? Uniform? Perhaps some age-related shifts around university?



RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments

Skip to toolbar