By David Jonker,

Sr Director SAP Big Data Product Marketing, Technology & Innovation Platform

Big Data offers analysts and data scientists the opportunity to build more sophisticated and more accurate predictive models than before, but without the right data environment, it’s not easy. It requires an in-memory architecture that supports thousands of columns and billions of rows and a predictive analytics tool that can harness that architecture, such as SAP BusinessObjects Predictive Analytics.

Twentieth-century technology is insufficient. Blame it on the disk. Back in the 1980s, database engineers saw a world where memory was extremely expensive. Just one terabyte of RAM cost over $100 million US dollars. Today, we can get it for less than $5,000 US dollars. So, vendors built database architectures centered on the disk.

In a Big Data world, the disk is simply too slow. Consider this: reading 1 petabyte of data off a disk sequentially – i.e. no seeking, just end-to-end straight off the disk – takes 58 days using the fastest hard disk available today (according to the Tom’s Hardware website). SSD definitely speeds things up: two days with the fastest SSD RAID. It’ll cost millions to buy, though.

In many ways, Big Data is a real-time data access problem. That’s precisely why innovators are developing new ways to store and process data, all in an effort to get around the hard disk bottleneck. All of the approaches, in essence, minimize the bottleneck in order to improve response time.

Distributed Computing

Distributed computing spreads a lot of data across many disks that can all be read simultaneously. Hadoop builds on the concept of distributed computing, but opens up the platform to handle any data set with any arbitrarily designed algorithm. To overcome the disk, the Hadoop community built Apache Spark, which provides a distributed data processing architecture, like Hadoop HDFS, that operates in-memory across commodity hardware.

Columnar Databases

Like distributed databases and Hadoop, columnar databases optimize data storage architecture in order to reduce the amount of data read off any one disk. It does this by grouping related attributes, or columns, together. The assumption is that most analytical queries only use a subset of columns, so you should only access data related to those specific columns. They also highly compress the data, further reducing the number of bits read off disk.

In-Memory Databases

In-memory databases take it to a whole new level by removing the disk from the equation altogether. It leverages the power of today’s processors to read and analyze data at a raw speed that’s 1,000 to 10,000 times faster than reading data off the disk. In some cases, customers have experienced performance gains of 100,000 times faster. How?

–   Compress the data with in-memory columnar data stores

–   Move the data accessed most often into L1 caches on the chip

That’s why we are so bullish about in-memory and the SAP HANA platform for Big Data. That’s not to say disk solutions don’t have a role to play. But…at the core you want an in-memory system that can run algorithms where your data is. No moving the data to the algorithms, that doesn’t work in a Big Data world. Instead, move the core algorithms into the data system.

SAP BusinessObjects Predictive Analytics

SAP BusinessObjects Predictive Analytics is the right tool for business analysts and data scientists to build predictive models from Big Data. First and foremost, it can analyze data inside SAP HANA and Apache Spark. There’s no need to transfer data out of these environments for processing. Rather, the SAP BusinessObjects Predictive Analytics processing engine can run inside these tools –  dramatically improving performance.

SAP BusinessObjects Predictive Analytics is also able to analyze exceptionally wide datasets. In fact, you can have up 15,000 columns in a dataset, while other tools support only a few hundred to 1,000 columns at most. This ensures that your predictive models provide the greatest level of accuracy possible.

Big Data is radically altering our world. It’s a game changer. For those who grab hold of it, you have an opportunity to propel your business forward – and the surest way forward is with SAP BusinessObjects Predictive Analytics running on SAP HANA or Apache Spark. It is the best combination for building predictive models on Big Data, whether you’re a business analyst or data scientist.