Thursday, May 31, 2012

HBase Secondary Index: Scaning Using Multiple Keys

In HBase one can define row keys to store primary key and do scan using partial or full row keys as explained in previous blogs but what if we need another dimension to do search? That is we may need do a look up based on some other field which is not in our primary table row key.

 To achieve this we need to define a second table where row key have this second dimension and it stores the value of the primary row key. Lets look at the example of my previous blog, where we are storing sessions of users.

  userIdBytes+seperatorByte+dateStringByte+seperatorByte+sessionIdBytes

Here seperatorByte should be choosen in such a way which does not conflict with userId and sessionId bytes values. For example use LF (decimal 10)

Lets call above key as primaryKey.

Note that using above row key we can store, column and values in one or more column families for each
user session row. For example we may want to store session startTime and endTime as two values in two columns in a column family.

 Now lets say that we want to get all user sessions in a given geographical area (Get all users in US or get all users in UK), so how do we store this information? To do this can define a second table where row key is geographical area code plus primaryRow Key (ie. userId date and sessionId.)

geoAreaCodeBytes+seperatorByte+primaryKey

Now we can do a search for user sessions using geographical area, similar to what I explained in my previous blog. Also for each of these rows, we can find the primaryKey row for this row key itself and use it to find additional information that we store for user sessions in primaryKey.