Analytics and the cloud: NoSQL databases


This is the third in a series of blogs on analytics and the cloud. Read our introduction to the series, Analytics and the cloud: A perfect match. This blog discusses the rise of NoSQL databases:

  • What they are
  • What they do (the types of NoSQL databases available)
  • How they can act as a valuable new source of data for analytic engines

Although NoSQL database technology has been around for a long time (before SQL actually), not until the advent of Web 2.0, when companies such as Google and Amazon began using the technology, did NoSQL’s popularity really take off. Market Research Media forecasts NoSQL Market to be $3.4 Billion by 2020. This is being driven by a compound annual growth rate (CAGR) of 21 percent in the period 2015-2020. NoSQL applications are being seen in more and more industries, driven by their ease of implementation (mainly through having schema less models) and their rapid deployment. However this can have drawbacks which I’ll discuss below.

NoSQL databases are often found where solutions require close interaction between people or ‘things’, these solutions require information from mobile, social, sensors and big data to be pulled together to deliver apps and new products. This concept was termed ‘systems of engagement’ and has been accredited to Geoffrey Moore an English Professor who became a sales and marketing executive. This type of data often simply doesn’t fit into traditional relational database management systems (RDBMS) yet it is the data that gives insight into how best to serve individuals in their daily lives. Understanding this flow of data about where a person is at any given time, who and what they interact with, what they like and dislike, their mood and so on offers real competitive advantage to companies who can tune into this and respond accordingly. NoSQL databases are best suited to handle the necessary requirements of these high-performance, highly-scalable, big data workloads

What are NoSQL Databases?

As the name implies a NoSQL (or non-relational database) is designed to support different application requirements to traditional relational database management systems (RDBMS). RDBMS’s have supported transactions based on ACID (Atomicity, Consistent, Isolation, Durability) for many years, NoSQL databases support the CAP (Consistency, Availability and Partition-tolerance) theorem, which gives rise to a number of significant features that are markedly different to relational databases and are described below:

Scalability

NoSQL solutions use scale out rather than scale up which the majority of RDBMS solutions provide (Massively Parallel Processing architectures being an exception to this rule). NoSQL solutions scale by adding additional servers or nodes to increase capacity.

Sharding

Data can be horizontally partitioned (for example, by row) and rows stored across multiple servers. Large tables can be split across servers in this way allowing queries to be parallelised, improving performance. This does raise issues around how to deal with small tables and queries that don’t have obvious ways to slice the data, but these can be handled often by replicating certain data sets to eliminate problems.

Eventually consistent data

Data is replicated across nodes such that it may be possible that a query may return different data from two nodes at different times until the replication catches up with itself. This is one of the basic tenants behind the CAP theorem. Most NoSQL databases can inform the user that such an error has taken place and the data simply be reread at the appropriate time.

High Availability

Driven by CAP theorem, NoSQL databases support high availability. They should tolerate failure of any component, a single instance, server rack or even an entire data centre. The goal is to have zero downtime.

Note that we always assume we are working with multiple partitions; without multiple partitions we don’t have a distributed system. This is important because the CAP theorem states we can have consistency and partition tolerance or availability and partition tolerance. Consistency with availability is simply not considered as we always assume partitions are involved which means we have to cater for network latencies (consistency) and down times for nodes (availability).

Schemaless

NoSQL databases are often referred to as schemaless in that there is no enforcement of a structure when writing to the database, new items can be added to the database at any point in any structure (say a new key value pair in a document). However, things like indices and partitioning of data will have been considered which could be classed as part of a schema for the data, and field names need to be consistent (say postcode always referred to as such) or accessing data becomes very difficult.

This approach brings great flexibility to adding new data, but does mean that things like referential integrity needs to be managed by the application code which can be time consuming to develop and / or difficult to maintain.

Analytics and the cloud: NoSQL databasesWhat they can do and be used for

There are four primary styles of NoSQL databases.

1. Key Value pairs

Simple as a key value with data assigned to the value which can be more or less anything. An example is shown in the table below:

Key

Value

MusicName:1

Wishbone Ash

MusicGenre:1

Rock

MusicName:2

Johnny Cash

MusicGenre:2

Country

In theory, the key could be anything, but database limitations and performance issues means keeping the key length to some sensible and well understood terms is good practice. The value can be anything, text, numbers, HTML, image and so on.

This forms of database are often used for operational datastore requirements so for example to support the creation of a shopping cart content on a website, product reviews, blogs and so forth.

Riak and Redis Server are examples of this style of database.

2. Column stores

Similar to Key Value pair, we now have many columns but using metadata to describe content of each column. These databases are good for managing sparse datasets a common finding in Big Data solutions. An example is given below:

Row Key

Personal Data

Company Data

Employee_id

LastName

Firstname

Age

Salary

Grade

1

Fisher

King

49

 

8

2

Coote

Fred

 

35000

7

3

Lockwos

 

57

80000

9

So row based database would store each row as

1:Fisher, King, 49, ,8 ,

2:Coote, Fred, , 35000, 7

3:Lockwos, , 57,80000, 9

whereas a column driven database would store data as

1:Fisher, 2:Coote, 3:Lockwos

1:King, 2: Fred, 3:

1:49, 2: , 3:57

1: , 2:35000, 3:80000

1:8, 2:7, 3:9

Structuring data in this manner can help with analytic workloads (OLAP style queries) where not all the data associated with a row is needed. In addition, columns families (Personal Data and company Data in the table above) can be used to group columns together. New columns are very easy to add and become part of a column family if required. Note that empty cells are not stored.

Cassandra and HBase are examples of this style of database.

3. Document stores

These forms of NoSQL databases hold the entire detail for a record within a document. They are generally held in XML or Json format and it’s very easy to extend the document and add new data as and when needed. This is one of the primary advantages of schema less data stores. So with document stores any two documents could hold different structure and data types. For example, within a relational data store with ‘person’ details if firstname and lastname were required it would be a column in the schema even if not actually entered in the records themselves. With a document store, one document could hold just lastname and another firstname and lastname with no issues.

One difference between a document and key value pair that’s useful is with a key value pair you can only return the whole record and then work with that. With a document store you can query the document and just return elements from within it, which can help with performance as less data is returned to the client.

Content management systems are a good example of what these data stores can support because of their capability in storing all forms of data types and query flexibility.

Cloudant and MongoDB are examples of this style of database.

4. Graph

Is different to the other data stores in style and concept. This mode stores data as a collection of nodes and vertices. Each node relates to an entity (‘thing’), and vertices describe linkages between nodes (eg: Father of, owned by, part of etc). Once again there is no predefined schema and new nodes and edges can be added very easily to the model.

Each node and vertex can have properties assigned to it such as age, colour, or part number, and is often referred to as a ‘property graph’. The storage engine can be varied for graph style structures and range from traditional relational databases through to purpose built graph database engines. Open source graph database’s such as Titan are capable of scaling to billions of vertices and edges. It’s worth noting that the Titan database uses a variety of different databases for the back end storage these are Apache Cassandra, Apache Hbase or Oracle BerkeleyDB

Graph style databases are gaining in popularity as they are capable of dealing with very complex queries that span multiple nodes

Titan, Janus Graph and Neo4j are examples of this style of database.

There are some definitions that refer to multi model databases in this category of NoSQL, these are amalgams of the basic 4 models described above.

So how can these databases all extend the power of analytics?

Using NoSQL database for analytics

Mobile app

Consider a mobile device application. Where traditional relational databases had maybe at most tens of thousands of users, a critical or well-promoted mobile app could easily have many million. Such solutions require characteristics that can go beyond what relational databases were envisaged to do. Internet-scale web and mobile applications have requirements that traditional relational databases were not planned for such as nonstop availability, huge scalability, high volume writes, schema less models to enable change rapidly and continuous cycle of delivery to make changes to app on a daily (or less) basis. These requirements are a fit for the NoSQL databases we have discussed. Many of today’s mobile applications have a NoSQL engine behind them to store customer data and enable massive scale (billions of customers), high availability (always on) and high transactional performance.

The cloud plays an important role here. The service levels required (availability, performance and capacity for instance) all point towards using the NoSQL database as a service on a cloud platform. This not only meets the non-functional requirements as described but also ensures that as capacity or workloads vary over time a billing model that supports this can be put in place. So, if transactions rapidly increase, new capacity can easily be added and billed just for the period required. In addition, having resources on the cloud allows for a rapid deployment model where changes can be rolled out in an agile manner, constantly bringing new features and services to clients as demanded.

Once the data has been captured by a NoSQL data store what are the options for analysing the data? Beyond the obvious operational uses the app was first designed for there are a massive number of data points we can make use of. Location services, orientation of device, data held within the app itself, time when actions taken, contact lists, phone calls, SMS messages, social data and so on. With proper access, all these could potentially be used with other traditional data sources and open data sources, such as that from weather channels or government services, to build up a much more comprehensive picture of an individual’s requirements at any given point in time and location. Again, the use of these stores for analytical purposes can also suit a cloud model. For example, burst capacity when intensive analytical workloads are required for short periods of time can more easily be catered for than in traditional environments. The cloud is also now the host for many of the other data sets that are mentioned above. It’s often easier to merge data in the cloud rather than try and move all the required data in house.

This can form a complete view of a customer’s profile at any given point, which can help with identifying their particular preferences and offer new, useful services as they require them.

There are now several options for analyzing the data:

  1. Build code directly to access the NoSQL stores to generate reports
  2. Tightly integrate with another database engine that is designed for analytical use
  3. Use an analytics engine in-memory processing, such as IBM Spark

Currently the two most realistic solutions are:

  • to replicate the database to a massively parallel processing (MPP) style database engine for subsequent processing and BI style analytics, or
  • use an in-memory analytical engine (Spark) to run a variety of sophisticated analytical techniques on the data. Both solutions require that data is moved from the NoSQL store to the engine in use.

Running BI-style tools against NoSQL engines isn’t commonly done (as of writing this blog), primarily due to the complexity of building the same sort of queries that SQL and BI tools such as IBM Cognos can readily and easily build today. However, examples of solutions using Hadoop with Hive and tools such as Apache Hawq and Apache MADlib have started to surface which allows BI tools and Machine Learning against file systems such as HDFS or object storage to become more common. For example, IBM Cognos can access HDFS data using Hive and can potentially access Spark data using Hive extensions for Spark. The lines are becoming blurred and a variety of tools are available.

Conclusions

NoSQL plays a very important role in any new analytics landscape. It should be considered for a multiplicity of use cases and plays a role in any Big Data/Data Lake reference architecture. It often requires a RDBMS style engine to drive the actual analytics engine, but the NoSQL engine can be used for rapid and simple capture of high velocity data that is subject to changes over time.

IBM’s Bluemix Platform as a Service (PaaS) plays a critical role here, having access to services such as mongodb, Cloudant, db2 and dashDB and much more. This allows developers to quickly develop and build solutions that can be stress tested for production. Analytics using real time engines and reporting capabilities in Bluemix could allow swift development of analytical insights. Using technologies such as Spark we can access and shift data, deeply analyse the data sets made available and utilize a variety of pre-supplied libraries with Python or Scala. Rich visualization of the data is available that helps to build new hypotheses and discover new relationships in the data that is very compelling.

Finally, the rapid acceptance of graph databases also allows the captured data to be visualized in innovative ways and explore non-obvious relationships between entities that are highly valued when it comes to use cases such as fraud, cyber, network analysis, logistics and so forth.

To learn more about how IBM can help you derive value from your data, visit our Analytics Platform page.

Follow @IBMBigData

This entry was posted in Big Data. Bookmark the permalink.