Big Data Now summary - AtmaMani/pyChakras Wiki

book cover

Introduction - Big Data's big ideas

Companies that use data and analytics to drive decision making, continue to outperform their peers

Cognitive Augmentation

3 Personas that are customers to a predictive API from a big data analytics company

Graph databases and analysis was a new way of thinking about connected devices and social media. However, it's applicability to other areas such as network impact analysis, route finding, recommendations, logistics, fraud investigation is realized. GraphX is a popular software and GraphLab conference in SF has a showcase of best tools in this field.

Intelligence Matters

Revelations from early machine learning studies indicate machines excel in tasks that were considered hard for humans to solve - those with highly abstract, well-defined jobs. However machines struggle at tasks we take for granted like driving or navigating smoothly in the physical world. Further, current practitioners consider the best applications of AI is in the prediction, recommendation business of placing ads.

Some questions to ponder on machine learning

How Google changed the paradigm of machine learning

Google's astonishing success to machine learning and artificial intelligence can be attribute to this simple logic:

simple models fed with very large datasets outperformed sophisticated and theoretical models which were all the rage before the big-data era.

Thus Google's approach is

  1. First collect huge amounts of training data - probably more than anyone thought sensible or even possible a decade ago
  2. Massage and preprocess that data so key relationships it contains are apparent (feature engineering)
  3. Feed the result into very high-performance, massively parallelized implementations of standard machine learning methods like logistic regression, k-means, deep neural networks.

However, there are situations where Google's approach has shortcomings

  1. Where data is inherently small - or where massive data collection is illegal or expensive
  2. The data collected cannot be interpreted without a sophisticated model
  3. The data cannot be pooled across users or organizations - whether for whatever political or privacy reasons.

One immediate example is the self-driving car problem. Current generations require highly mapped out accurate 3D maps. It is very simple for these cars to interpret stop signs, traffic light colors when they know where to look for them from pre-observed street view images. However, they need extremely sophisticated algorithms for unmapped roads or to even navigate in unknown terrain such as Mars.

Deep Learning

Deep Learning is one approach to building and training neural networks. One of the problems with neural networks were training. Although theoretically feasible on a small scale, they become challenging when training required multiple weights and on a larger scale. The real advantage in deep learning is that it no longer requires the arduous manual step of sorting through input datasets that are rich in information.

Keeping sane

Machine learning's advancements in certain human like responses (voice recognition and answer back) does not mean they are equally advanced and efficient in all other fields applicable. Thus at this point in time, we cannot expect AI to surpass or replace human workers. Devices like Watson are meant to be a part of conversation when they analyze medical records and not to replace a doctor.

The convergence of cheap sensors, fast networks and distributed computing

The rage in data collection right now is stream processing. A lot of new tools for processing live data have become popular. There are even tools called FeatureStream which is a service to extract features from live data and feed to machine learning algorithms.

Embracing hardware data

The theme of SolidCon was that the cost of bringing a hardware start-up to market would soon reach that of a software start-up. Why is this?

With all this increase in hardware, data collection is bound to explode allowing more and real time data for scientists to play with.

Extracting value from IoT

IDC predicts there will be 212 billion things connected to the internet by 2020. GE is working on a industrial internet that consists of 3 things - intelligent machines, advanced analytics and empowered users. One of the biggest applications of IoT in industry is predictive maintenance. Its good for companies to know parts from which vendors fail, how often they fail and the conditions in which they fail. The machine can be safely taken offline, repaired before it causes a catastrophe.

Union Pacific uses IR and audio sensors on railroads to gauge the state of wheels and bearings and ultrasound to spot flaws or damage in critical components. It collects 20,000,000 sensor readings from 3,350 trains and 32,000 miles of track. It uses pattern matching algorithms to detect potential issues and flags for action. UP has reduced bearing related derailments by 75%.

Pendulum of distributed computing - clouds, edges, fogs

One perspective on the history of computing is that it has been a pendulum. Oscillation 1: First computers were as large as rooms. Then came the mainframe era with powerful centrals and dumb terminals. As cost of computing dropped, the importance of UI rose and gave rise to Personal Computers that could exist and be of value even without a network.

Oscillation 2: With the rise of WWW, servers and datacenters got the glory. The client was pushed to mere devices that rendered HTML. Later, when client browsers started showing capacity of loading web-applications (AJAX, FLEX, JS like technologies), and with the rise of mobile phones and tables, the pendulum completed its 2nd oscillation.

Oscillation 3: We are now seeing the rise of dumb edges in the form of IoT. However, quickly enough there is interest in fog computing by Cisco. It consists of a local cloud of tiny sensors and localized powerful computing which then connect to a bigger network.

Data Pipelines

Researchers are now building tools that aid in feature discovery. Data scientists spend so much time on data wrangling and preparation in search of variables to build their models. These variables are called features in ML parlance.

Good features allow a simple model to beat a complex model

Feature selection techniques

Data scientists want to select features and not use all variables in cases where they want quick results or when they have to quickly explain correlation and causation to non-technical folks. 3 common methods of selecting features

We are poised to see many products that aid feature selection. The above 3 methods are not far reaching in big data space where the number of features are just as enormous as volume of data itself.

Evolving, maturing marketplace of big data components

Database storage redesigned for Solid State Memories

The benefits of flash storage are multifold when data IO is redesigned to tap their true potential rather than simply swapping magnetic disks with flash. This is important since the speed of flash is tapered off, thus to yield the next big breakthrough, companies are optimizing DB design.

Characteristics of flash that influence database design

Apache Hadoop 2.0 is a significant upgrade from 1.0 due to the introduction of YARN. HDFS2 (Hadoop Data File System) is the file system and YARN is the OS.

As technology for storing and retrieving big data is reaching first round of maturity, companies are focusing on solving data problems in specific industries instead of building generic tools from scratch. A majority of these solutions are built on top of open source technologies and they contribute back to those products as well.

An OS for data center

Current data center software have machines as the level of abstraction. Thus we have 1 application per machine - one for analytics (gp), one for databases (sde), one for web server (portal), one for message queues etc. These highly static inflexible partitions are bound to go up as companies move away from monolithic architecture to Service Oriented Architecture (SOA). Depending on load a typical datacenter runs at 8-15% efficiency.

We need an OS layer in data center that abstracts the hardware resources just like an OS does on personal computers. The data center OS layer allows any machine to run any application and this OS layer would sit on top of Linux host OS. Thus we need an API for data centers.

Iaas or PaaS don't solve the API need. IaaS (Infrastructure as a Service) still delivers access to machines (primarily VMs). We still need to configure applications on these machines. PaaS (Platform as a Service) on the other hand abstracts machines and delivers applications. Thus a data center API will be a middle ground between the two. Apache Mesos is a product in this direction.

Building a data culture

Despite the hype about big-data, the challenges organizations face today is the ability to collect, store and process large volumes of data doesn't confer advantage by default. A popular myth is our ability to predict reliably improves with the volume of data available. Some other myths include correlation is as good as causation. The paradox of data is that, the more data we have, the more spurious correlation will show up.

Perils of big data

Importance of anonymizing algorithms

Data analysis without personal revelations is the ultimate goal and the core techniques involve algorithms that can perform queries and summaries on encrypted data. Other approaches include not requiring to classify all information - when finding a match for a wanted felon, don't try to identify all faces in a public surveillance video. Only find matches against required face. A similar case is attempting to find a financial turmoil from the portfolios of stock traders without identifying which trader is holding which portfolio.

Approaches

CryptDB used by Google allows the same value to have the same encrypted value everywhere it appears in a dataset. This allows aggregation. Another popular technique is creation of synthetic datasets from real datasets by scrambling and introducing a random noise. Such datasets allow computation but would not correspond to a real person.