Trending
Heat Index
Databases
Most Recent
 
Read More
September 1, 2017

What is big data and Hadoop

While the term ‘big data’ is used a lot by small and large organizations alike, it doesn’t always mean that they’ve got a firm grasp on the concept of the technology and its benefits. As such, the ideal starting point of this post is to discuss the concept in a little more detail, ensuring that we have common understanding of the subject matter before we delve any further into the detail.

To quote SAS (source), “Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate [...]

8717
 
Read More
May 3, 2017

Creating a new table from file in Hadoop Hive

Today, I was provided with a beta version of a data feed that would be consumed by the Hadoop platform. As it’s not been configured to run into the platform yet, there was no way to query the data to ensure we had all the raw data we’d eventually need to extract insights and run vital reports for business consumers.

The data I was provided was a feed from the test environment and was in CSV format. For this data to be useful, I wanted to load it into the Hadoop cluster and run some queries, calculations & aggregations on it using Hue. To to this, I needed to create a new table and populate it with my shiny new data set. So, I created a new folder in my user [...]

3411
 
Read More
May 1, 2017

Optimizing database queries

When you’re querying 10,000 rows of data you can be sloppy. It doesn’t actually matter how inefficiently you write your queries, they’ll run in a reasonable amount of time and you’ll extract the insight you needed. That’s because 10,000 rows is tiny and you don’t need much compute power to get those numbers crunched.

However, when you start querying 1 billion rows, things start to get interesting. Your ‘Select *’ statement is a big no-no when you’ve got 100 columns and 1 billion rows – you need to think smart. You need to really streamline your queries to ensure reasonable execution time & resource [...]

2274
 
Read More
April 27, 2017

Using Regex in Hadoop Hive queries

I was working on a query today – something which could be executed against the Hadoop cluster using Hive & visualised in Tableau. While writing the query, I found that a few of the string functions I’d usually use in SQL weren’t valid and created ‘unknown function’ errors. So, I started working through each of the areas for which I was receiving an error, until I had a working query. That query is below:

SELECT table1.dt, field2, table2.postcode as mgpcode, table3.postcode as lookuppcode, lat, long, REGEXP_REPLACE(table2.postcode, '\\s+', '') as newpostcode, REGEXP_REPLACE(table3.postcode, '\\s+', '') as normalizedpostcode FROM [...]

10612
Business Analysis
 
Read More
1895
 
Read More
3032
 
Read More
5306

 
Read More
4169
 
Read More
5685
 
Read More
2653

 
Read More
1516
Trending Topics
Netshock Small Business Technology Blog
The Cloud
Amazon Web Services
Business Intelligence
Building my own tech
Technology blog news
Tech
Business Analysis
Digital Ocean
AWS eBook
Free Small Business CRM
Netshock Web Design
Netshock Apparel
AWS Articles
 
 
 
Top Ten
Heat Index
 
1
A detailed look at AWS S3
 
2
Technology operating procedures (SOPs, MOPs, EOPs and SCPs)
 
3
Qlikview Lookup() Function
 
4
AWS Security Concepts
 
5
The ultimate on-page SEO guide
 
6
Schedule data reload on Qlikview Desktop
 
7
Elgg or Buddypress to build a social network?
 
8
Using Regex in Hadoop Hive queries
 
9
Using an API to retrieve motorsport results
 
10
What is big data and Hadoop