In [18]:
import pandas as pd
path = 'travel.csv'
df = pd.read_csv(path, delimiter=',', header='infer')
df
Out[18]:
userid country date hour
0 82718927392 UK 20101025 0
1 82718927392 UK 20101025 1
2 82718927392 UK 20101025 1
3 82718927392 UK 20101025 1
4 82718927392 Spain 20101025 2
5 82718927392 Spain 20101025 2
6 82718927392 Spain 20101025 2
7 82718927392 Spain 20101025 3
8 82718927392 Portugal 20101025 4
9 82718927392 Portugal 20101025 5
10 99 UK 20101025 1
11 99 UK 20101025 1
12 99 China 20101025 15
13 99 China 20101025 16
14 99 China 20101025 16
15 99 Singapore 20101025 19
16 99 Singapore 20101025 20
17 99 Singapore 20101025 22
18 99 Singapore 20101025 23

Next, we group by userid and country & select the minimum hour. This will be the first hour that activity was seen in that country for that user.

In [19]:
df2 = df.groupby(['userid', 'country'], sort=False)[['hour']].min()
In [20]:
df2.head()
Out[20]:
hour
userid country
82718927392 UK 0
Spain 2
Portugal 4
99 UK 1
China 15

Next, we need to use the rank function to work out the order in which the users visited the countries.

  • df2 with the new column 'hop' is equal to:
  • df2 grouped at level 0 (which is the column upon which you want to group - in this case, I want the rank calculated, grouped by user. So, userid is (column 0), is the groupby level.)
  • Hour is ranked (using the dense rank method) which is where all scores are ranked. If two scores that are the same are found, they will be assigned the same rank (i.e. if 2 countries are found in the same hour, they will be ranked as the same hop.)
  • The output should be an integer
In [21]:
df2['hop'] = df2.groupby(level=0).hour.rank(method='dense').astype(int)
In [22]:
df2.head()
Out[22]:
hour hop
userid country
82718927392 UK 0 1
Spain 2 2
Portugal 4 3
99 UK 1 1
China 15 2

In the below, we use the .shift() function. The shift function shifts the index by a desired number. So, in the below, we're looking at df2 row 0 hour which is 0. We then shift +1, to look at df2 row 1 & we compare the difference between the two.

In [23]:
df2['time_in_loc'] = df2['hour'] - df2['hour'].shift(1)
In [24]:
df2
Out[24]:
hour hop time_in_loc
userid country
82718927392 UK 0 1 NaN
Spain 2 2 2.0
Portugal 4 3 2.0
99 UK 1 1 -3.0
China 15 2 14.0
Singapore 19 3 4.0

As you can see in the above, the issue is that this calculates each row, rather than each row for each user. So the first location for the second user is calculating versus the last hour of the first user. Not ideal - we'll handle this in the follow up section.

In [ ]: