Business Intelligence
Now Reading
Using Regex in Hadoop Hive queries

Using Regex in Hadoop Hive queries

by KieranApril 27, 2017

I was working on a query today – something which could be executed against the Hadoop cluster using Hive & visualised in Tableau. While writing the query, I found that a few of the string functions I’d usually use in SQL weren’t valid and created ‘unknown function’ errors. So, I started working through each of the areas for which I was receiving an error, until I had a working query. That query is below:

SELECT table1.dt, field2, table2.postcode as mgpcode, table3.postcode as lookuppcode, lat, long,
REGEXP_REPLACE(table2.postcode, '\\s+', '') as newpostcode,
REGEXP_REPLACE(table3.postcode, '\\s+', '') as normalizedpostcode
FROM DBName.Table1
LEFT JOIN DBName.Table2 ON table1.location = table2.location
LEFT JOIN DBName.table3 on REGEXP_REPLACE(table2.postcode, '\\s+', '') = REGEXP_REPLACE(table3.postcode, '\\s+', '')
WHERE table1.dt="20170419" AND table1.event_result="success" AND ((table1.field2="true" AND table1.field3 LIKE "existing.%") OR (table1.field2="Unknown"));

This query has three tables:

  • Table 1 has a location field
  • Table 2 has a location field & a postcode for that location
  • Table 3 has a list of postcodes & their respective latitude & longitude values

I’ll need to join all three tables, two of which will include the postal codes – in all sorts of different formats, like below:

  • xx0 0xx
  • xx00 0xx
  • xx00xx
  • xx000xx

We need the postcode in table 2 to match perfectly with table 3. The easiest way to do this, is to remove the spaces from both tables and to create two new fields based on the calculated values – these are the fields on which I want to join the two tables. So, here are my regex expressions to remove spaces from the postal code:

REGEXP_REPLACE(table2.postcode, '\\s+', '') as newpostcode,
REGEXP_REPLACE(table3.postcode, '\\s+', '') as normalizedpostcode

Now, I need to join on those two new fields (newpostcode and normalizedpostcode). To do this, I can’t just insert the new field name into the join condition – it doesn’t work. Instead, I need to join with the regular expression in the join condition:

LEFT JOIN DBName.table3 on REGEXP_REPLACE(table2.postcode, '\\s+', '') = REGEXP_REPLACE(table3.postcode, '\\s+', '')

Something worth noting is – if you have two fields with the same name in the select statement, derived from different tables, you’ll need to format them as ‘tablename.fieldname as xxxx’ and provide them with their own unique alias. Otherwise, you’ll receive errors about ambiguous field names.

What's your reaction?
Love It
0%
Interested
0%
Meh...
0%
What?
0%
Hate It
0%
Sad
0%
About The Author
Kieran

My name is Kieran, I love to see how technology can drive business growth. I started the Netshock technology blog as a place to share my thoughts and experiences with a wider audience. I cover all sorts of topics, from marketing to development.