Heat Index
Most Recent
Read More
September 1, 2017

What is big data and Hadoop

While the term ‘big data’ is used a lot by small and large organizations alike, it doesn’t always mean that they’ve got a firm grasp on the concept of the technology and its benefits. As such, the ideal starting point of this post is to discuss the concept in a little more detail, ensuring that we have common understanding of the subject matter before we delve any further into the detail.

To quote SAS (source), “Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate [...]

Read More
May 3, 2017

Creating a new table from file in Hadoop Hive

Today, I was provided with a beta version of a data feed that would be consumed by the Hadoop platform. As it’s not been configured to run into the platform yet, there was no way to query the data to ensure we had all the raw data we’d eventually need to extract insights and run vital reports for business consumers.

The data I was provided was a feed from the test environment and was in CSV format. For this data to be useful, I wanted to load it into the Hadoop cluster and run some queries, calculations & aggregations on it using Hue. To to this, I needed to create a new table and populate it with my shiny new data set. So, I created a new folder in my user [...]

Read More
May 1, 2017

Optimizing database queries

When you’re querying 10,000 rows of data you can be sloppy. It doesn’t actually matter how inefficiently you write your queries, they’ll run in a reasonable amount of time and you’ll extract the insight you needed. That’s because 10,000 rows is tiny and you don’t need much compute power to get those numbers crunched.

However, when you start querying 1 billion rows, things start to get interesting. Your ‘Select *’ statement is a big no-no when you’ve got 100 columns and 1 billion rows – you need to think smart. You need to really streamline your queries to ensure reasonable execution time & resource [...]

Read More
April 27, 2017

Using Regex in Hadoop Hive queries

I was working on a query today – something which could be executed against the Hadoop cluster using Hive & visualised in Tableau. While writing the query, I found that a few of the string functions I’d usually use in SQL weren’t valid and created ‘unknown function’ errors. So, I started working through each of the areas for which I was receiving an error, until I had a working query. That query is below:

SELECT table1.dt, field2, table2.postcode as mgpcode, table3.postcode as lookuppcode, lat, long, REGEXP_REPLACE(table2.postcode, '\\s+', '') as newpostcode, REGEXP_REPLACE(table3.postcode, '\\s+', '') as normalizedpostcode FROM [...]

Business Analysis
Read More
Read More
Read More

Read More
Read More
Read More

Read More
Trending Topics
Netshock Small Business Technology Blog
The Cloud
Amazon Web Services
Business Analysis
Business Intelligence
Building my own tech
Marketing my business
Technology blog news
Monkey Worldwide
Free Small Business CRM
AWS eBook
Netshock Web Design
AWS Articles
Top Ten
Heat Index
AWS Security Concepts
Technology operating procedures (SOPs, MOPs, EOPs and SCPs)
A detailed look at AWS S3
Cloud HSM & KMS Services
Qlikview Lookup() Function
Schedule data reload on Qlikview Desktop
The ultimate AWS exam guide
Why do mergers & acquisitions fail so frequently?
Entrepreneurs & the lean start up
The ultimate on-page SEO guide