While the term ‘big data’ is used a lot by small and large organizations alike, it doesn’t always mean that they’ve got a firm grasp on the concept of the technology and its benefits. As such, the ideal starting point of this post is to discuss the concept in a little more detail, ensuring that we have common understanding of the subject matter before we delve any further into the detail.
To quote SAS (source), “Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate [...]
Today, I was provided with a beta version of a data feed that would be consumed by the Hadoop platform. As it’s not been configured to run into the platform yet, there was no way to query the data to ensure we had all the raw data we’d eventually need to extract insights and run vital reports for business consumers.
The data I was provided was a feed from the test environment and was in CSV format. For this data to be useful, I wanted to load it into the Hadoop cluster and run some queries, calculations & aggregations on it using Hue. To to this, I needed to create a new table and populate it with my shiny new data set. So, I created a new folder in my user [...]
When you’re querying 10,000 rows of data you can be sloppy. It doesn’t actually matter how inefficiently you write your queries, they’ll run in a reasonable amount of time and you’ll extract the insight you needed. That’s because 10,000 rows is tiny and you don’t need much compute power to get those numbers crunched.
However, when you start querying 1 billion rows, things start to get interesting. Your ‘Select *’ statement is a big no-no when you’ve got 100 columns and 1 billion rows – you need to think smart. You need to really streamline your queries to ensure reasonable execution time & resource [...]
I was working on a query today – something which could be executed against the Hadoop cluster using Hive & visualised in Tableau. While writing the query, I found that a few of the string functions I’d usually use in SQL weren’t valid and created ‘unknown function’ errors. So, I started working through each of the areas for which I was receiving an error, until I had a working query. That query is below:
SELECT table1.dt, field2, table2.postcode as mgpcode, table3.postcode as lookuppcode, lat, long, REGEXP_REPLACE(table2.postcode, '\\s+', '') as newpostcode, REGEXP_REPLACE(table3.postcode, '\\s+', '') as normalizedpostcode FROM [...]