Big Data Now summary - AtmaMani/pyChakras GitHub Wiki

book cover

Introduction - Big Data's big ideas

Companies that use data and analytics to drive decision making, continue to outperform their peers

Cognitive Augmentation

3 Personas that are customers to a predictive API from a big data analytics company

  • data scientists - familiar with stat modeling, machine learning methods and are drawn to high-quality open-source libraries like scikit-learn
  • data analysts - skilled in data transformation and manipulation yet limited coding ability. They prefer a workbench approach as offered by SAS, SPSS.
  • application developers - they just want an API to develop on. They are not interested in the inner workings

Graph databases and analysis was a new way of thinking about connected devices and social media. However, it's applicability to other areas such as network impact analysis, route finding, recommendations, logistics, fraud investigation is realized. GraphX is a popular software and GraphLab conference in SF has a showcase of best tools in this field.

Intelligence Matters

Revelations from early machine learning studies indicate machines excel in tasks that were considered hard for humans to solve - those with highly abstract, well-defined jobs. However machines struggle at tasks we take for granted like driving or navigating smoothly in the physical world. Further, current practitioners consider the best applications of AI is in the prediction, recommendation business of placing ads.

Some questions to ponder on machine learning

  • how should we measure intelligence?
  • We pay a lot of importance to end result such as playing chess, image classification, medical diagnosis, etc. But fail to peer into how this intelligence is created.
  • What is probabilistic programming and representational networks? In what way are they important to future of machine learning.

How Google changed the paradigm of machine learning

Google's astonishing success to machine learning and artificial intelligence can be attribute to this simple logic:

simple models fed with very large datasets outperformed sophisticated and theoretical models which were all the rage before the big-data era.

Thus Google's approach is

  1. First collect huge amounts of training data - probably more than anyone thought sensible or even possible a decade ago
  2. Massage and preprocess that data so key relationships it contains are apparent (feature engineering)
  3. Feed the result into very high-performance, massively parallelized implementations of standard machine learning methods like logistic regression, k-means, deep neural networks.

However, there are situations where Google's approach has shortcomings

  1. Where data is inherently small - or where massive data collection is illegal or expensive
  2. The data collected cannot be interpreted without a sophisticated model
  3. The data cannot be pooled across users or organizations - whether for whatever political or privacy reasons.

One immediate example is the self-driving car problem. Current generations require highly mapped out accurate 3D maps. It is very simple for these cars to interpret stop signs, traffic light colors when they know where to look for them from pre-observed street view images. However, they need extremely sophisticated algorithms for unmapped roads or to even navigate in unknown terrain such as Mars.

Deep Learning

Deep Learning is one approach to building and training neural networks. One of the problems with neural networks were training. Although theoretically feasible on a small scale, they become challenging when training required multiple weights and on a larger scale. The real advantage in deep learning is that it no longer requires the arduous manual step of sorting through input datasets that are rich in information.

Keeping sane

Machine learning's advancements in certain human like responses (voice recognition and answer back) does not mean they are equally advanced and efficient in all other fields applicable. Thus at this point in time, we cannot expect AI to surpass or replace human workers. Devices like Watson are meant to be a part of conversation when they analyze medical records and not to replace a doctor.

The convergence of cheap sensors, fast networks and distributed computing

The rage in data collection right now is stream processing. A lot of new tools for processing live data have become popular. There are even tools called FeatureStream which is a service to extract features from live data and feed to machine learning algorithms.

Embracing hardware data

The theme of SolidCon was that the cost of bringing a hardware start-up to market would soon reach that of a software start-up. Why is this?

  • Embedded computing used to mean writing in special dialects of C. With increasing computing, and substantial amounts of memory, this is possible with subsets of C++ and even with interpreted languages such as Python and JavaScript.
  • Many aspects of hardware design can be liberally prototyped in software. Next these prototypes can be built using cheap Raspberry Pis or audrino before real circuit boards are put into place
  • Most of hardware these days is distributed to a simple hardware unit, IO, and accompanying smartphone app. This enables hardware with no or little UI, buttons & display.
  • With 3D printing advancements, prototyping hardware has gotten easier.
  • Mentors and accelerators are increasing for hardware start-ups providing seed capital and networks.

With all this increase in hardware, data collection is bound to explode allowing more and real time data for scientists to play with.

Extracting value from IoT

IDC predicts there will be 212 billion things connected to the internet by 2020. GE is working on a industrial internet that consists of 3 things - intelligent machines, advanced analytics and empowered users. One of the biggest applications of IoT in industry is predictive maintenance. Its good for companies to know parts from which vendors fail, how often they fail and the conditions in which they fail. The machine can be safely taken offline, repaired before it causes a catastrophe.

Union Pacific uses IR and audio sensors on railroads to gauge the state of wheels and bearings and ultrasound to spot flaws or damage in critical components. It collects 20,000,000 sensor readings from 3,350 trains and 32,000 miles of track. It uses pattern matching algorithms to detect potential issues and flags for action. UP has reduced bearing related derailments by 75%.

Pendulum of distributed computing - clouds, edges, fogs

One perspective on the history of computing is that it has been a pendulum. Oscillation 1: First computers were as large as rooms. Then came the mainframe era with powerful centrals and dumb terminals. As cost of computing dropped, the importance of UI rose and gave rise to Personal Computers that could exist and be of value even without a network.

Oscillation 2: With the rise of WWW, servers and datacenters got the glory. The client was pushed to mere devices that rendered HTML. Later, when client browsers started showing capacity of loading web-applications (AJAX, FLEX, JS like technologies), and with the rise of mobile phones and tables, the pendulum completed its 2nd oscillation.

Oscillation 3: We are now seeing the rise of dumb edges in the form of IoT. However, quickly enough there is interest in fog computing by Cisco. It consists of a local cloud of tiny sensors and localized powerful computing which then connect to a bigger network.

Data Pipelines

Researchers are now building tools that aid in feature discovery. Data scientists spend so much time on data wrangling and preparation in search of variables to build their models. These variables are called features in ML parlance.

Good features allow a simple model to beat a complex model

Feature selection techniques

Data scientists want to select features and not use all variables in cases where they want quick results or when they have to quickly explain correlation and causation to non-technical folks. 3 common methods of selecting features

  • Domain expert to pick out. Some studies are using crowd sourcing as well.
  • Variable ranking procedure (AHP)
  • Dimensionality reduction - clustering, PCA, matrix factorization etc.

We are poised to see many products that aid feature selection. The above 3 methods are not far reaching in big data space where the number of features are just as enormous as volume of data itself.

Evolving, maturing marketplace of big data components

Database storage redesigned for Solid State Memories

The benefits of flash storage are multifold when data IO is redesigned to tap their true potential rather than simply swapping magnetic disks with flash. This is important since the speed of flash is tapered off, thus to yield the next big breakthrough, companies are optimizing DB design.

Characteristics of flash that influence database design

  • Random reads - data locality in physical address space is no longer a speed boost when it comes to flash compared to spin disks.

  • Throughput - 2 orders of magnitude faster than spin disk

  • Latency - about 5 - 50 times faster. However these speeds are tapering off.

  • Parallelism - multiple controllers. This is a significant design influence. In traditional DB design, low level locks prevent multiple write operations. This was fine in a spin disk since there was only 1 write head. With flash, DBs that allow multiple concurrent writes will benefit even greater.

  • Aerospike * stores indexes in RAM while data is in Flash. Cassandra tries to provide locality of reference and tries to defragment data during off peak load times. FoundationDB on the other hand allow the controllers in latest Flash devices to take care of preventing fragmented writes.

Apache Hadoop 2.0 is a significant upgrade from 1.0 due to the introduction of YARN. HDFS2 (Hadoop Data File System) is the file system and YARN is the OS.

As technology for storing and retrieving big data is reaching first round of maturity, companies are focusing on solving data problems in specific industries instead of building generic tools from scratch. A majority of these solutions are built on top of open source technologies and they contribute back to those products as well.

An OS for data center

Current data center software have machines as the level of abstraction. Thus we have 1 application per machine - one for analytics (gp), one for databases (sde), one for web server (portal), one for message queues etc. These highly static inflexible partitions are bound to go up as companies move away from monolithic architecture to Service Oriented Architecture (SOA). Depending on load a typical datacenter runs at 8-15% efficiency.

We need an OS layer in data center that abstracts the hardware resources just like an OS does on personal computers. The data center OS layer allows any machine to run any application and this OS layer would sit on top of Linux host OS. Thus we need an API for data centers.

Iaas or PaaS don't solve the API need. IaaS (Infrastructure as a Service) still delivers access to machines (primarily VMs). We still need to configure applications on these machines. PaaS (Platform as a Service) on the other hand abstracts machines and delivers applications. Thus a data center API will be a middle ground between the two. Apache Mesos is a product in this direction.

Building a data culture

Despite the hype about big-data, the challenges organizations face today is the ability to collect, store and process large volumes of data doesn't confer advantage by default. A popular myth is our ability to predict reliably improves with the volume of data available. Some other myths include correlation is as good as causation. The paradox of data is that, the more data we have, the more spurious correlation will show up.

Perils of big data

Importance of anonymizing algorithms

Data analysis without personal revelations is the ultimate goal and the core techniques involve algorithms that can perform queries and summaries on encrypted data. Other approaches include not requiring to classify all information - when finding a match for a wanted felon, don't try to identify all faces in a public surveillance video. Only find matches against required face. A similar case is attempting to find a financial turmoil from the portfolios of stock traders without identifying which trader is holding which portfolio.

Approaches

CryptDB used by Google allows the same value to have the same encrypted value everywhere it appears in a dataset. This allows aggregation. Another popular technique is creation of synthetic datasets from real datasets by scrambling and introducing a random noise. Such datasets allow computation but would not correspond to a real person.