Chat with us, powered by LiveChat Define, explain in detail, then present an actual example via research. Your paper must provide in-depth analysis of all the topics presented: > Read cases and white papers that talk abo - EssayAbode

Define, explain in detail, then present an actual example via research. Your paper must provide in-depth analysis of all the topics presented: > Read cases and white papers that talk abo

Follow a 3-paragraph format; Define, explain in detail, then present an actual example via research. Your paper must provide in-depth analysis of all the topics presented:

> Read cases and white papers that talk about Big Data analytics. Present the common theme in those case studies.

> Review the following Big Data Tutorial (attached).

> Choose one of the three applications for big data presented (Recommendation, Social Network Analytics, and Media Monitoring)

> Provide a case study of how a company has implemented the big data application and from your research suggest areas of improvement or expansion.

Need 8-10 pages in APA format with introduction and conclusion. Must include minimum of 9 peer-reviewed citations.

Marko Grobelnik, Blaz Fortuna, Dunja Mladenic

Jozef Stefan Institute, Slovenia

Sydney, Oct 22nd 2013

 Big-Data in numbers

 Big-Data Definitions

 Motivation

 State of Market

 Techniques

 Tools

 Data Science

 Applications ◦ Recommendation, Social networks, Media Monitoring

 Concluding remarks

 „Big-data‟ is similar to „Small-data‟, but bigger

 …but having data bigger it requires different approaches: ◦ techniques, tools, architectures

 …with an aim to solve new problems ◦ …or old problems in a better way.

 Volume – challenging to load and process (how to index, retrieve)

 Variety – different data types and degree of structure (how to query semi- structured data)

 Velocity – real-time processing influenced by rate of data arrival

From “Understanding Big Data” by IBM

 1. Volume (lots of data = “Tonnabytes”)  2. Variety (complexity, curse of

dimensionality)  3. Velocity (rate of data and information flow)

 4. Veracity (verifying inference-based models from comprehensive data collections)

 5. Variability  6. Venue (location)  7. Vocabulary (semantics)

Comparing volume of “big data” and “data mining” queries

…adding “web 2.0” to “big data” and “data mining” queries volume


 Key enablers for the appearance and growth of “Big Data” are:

◦ Increase of storage capacities

◦ Increase of processing power

◦ Availability of data

Source: WikiBon report on “Big Data Vendor Revenue and Market Forecast 2012-2017”, 2013

 …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem

◦ Modeling and reasoning with data of different kinds can get extremely complex

 Good news about big-data: ◦ Often, because of vast amount of data, modeling

techniques can get simpler (e.g. smart counting can replace complex model-based analytics)…

◦ …as long as we deal with the scale

 Research areas (such as IR, KDD, ML, NLP, Se mWeb, …) are sub- cubes within the data cube






 A risk with “Big-Data mining” is that an analyst can “discover” patterns that are meaningless

 Statisticians call it Bonferroni‟s principle: ◦ Roughly, if you look in more places for interesting

patterns than your amount of data will support, you are bound to find crap


 We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day ◦ 109 people being tracked. ◦ 1000 days. ◦ Each person stays in a hotel 1% of the time (1 day out of 100) ◦ Hotels hold 100 people (so 105 hotels). ◦ If everyone behaves randomly (i.e., no terrorists) will the data

mining detect anything suspicious?

 Expected number of “suspicious” pairs of people: ◦ 250,000 ◦ … too many combinations to check – we need to have some

additional evidence to find “suspicious” pairs of people in some more efficient way

Example taken from: Rajaraman, Ullman: Mining of Massive Datasets

 Smart sampling of data ◦ …reducing the original data while not losing the

statistical properties of data

 Finding similar items ◦ …efficient multidimensional indexing

 Incremental updating of the models ◦ (vs. building models from scratch)

◦ …crucial for streaming data

 Distributed linear algebra ◦ …dealing with large sparse matrices

 On the top of the previous ops we perform usual data mining/machine learning/statistics operators: ◦ Supervised learning (classification, regression, …)

◦ Non-supervised learning (clustering, different types of decompositions, …)

◦ …

 …we are just more careful which algorithms we choose ◦ typically linear or sub-linear versions of the algorithms

 An excellent overview of the algorithms covering the above issues is the book “Rajaraman, Leskovec, Ullman: Mining of Massive Datasets”

 Downloadable from:

 Where processing is hosted? ◦ Distributed Servers / Cloud (e.g. Amazon EC2)

 Where data is stored? ◦ Distributed Storage (e.g. Amazon S3)

 What is the programming model? ◦ Distributed Processing (e.g. MapReduce)

 How data is stored & indexed? ◦ High-performance schema-free databases (e.g.


 What operations are performed on data? ◦ Analytic / Semantic Processing

 Computing and storage are typically hosted transparently on cloud infrastructures ◦ …providing scale, flexibility and high fail-safety

 Distributed Servers ◦ Amazon-EC2, Google App Engine, Elastic,

Beanstalk, Heroku

 Distributed Storage ◦ Amazon-S3, Hadoop Distributed File System

 Distributed processing of Big-Data requires non-standard programming models ◦ …beyond single machines or traditional parallel

programming models (like MPI)

◦ …the aim is to simplify complex programming tasks

 The most popular programming model is MapReduce approach ◦ …suitable for commodity hardware to reduce costs

 The key idea of the MapReduce approach: ◦ A target problem needs to be parallelizable

◦ First, the problem gets split into a set of smaller problems (Map step) ◦ Next, smaller problems are solved in a parallel way ◦ Finally, a set of solutions to the smaller problems get synthesized

into a solution of the original problem (Reduce step)

Google Maps charts new territory into businesses

Google selling new tools for businesses to build their own maps

Google 4

Maps 4

Businesses 4

New 1

Charts 1

Territory 1

Tools 1

Google promises consumer experience for businesses with Maps Engine Pro

Google is trying to get its Maps service used by more businesses

Google Maps charts new territory into businesses

Google selling new tools for businesses to build their own maps

Businesses 2

Charts 1

Maps 2

Territory 1

Google promises consumer experience for businesses with Maps Engine Pro

Google is trying to get its Maps service used by more businesses

Map 2

Businesses 2

Engine 1

Maps 2

Service 1

Map 1

 Split according to the hash of a key

 In our case: key = word, hash = first character

Businesses 2

Charts 1

Maps 2

Territory 1

Businesses 2

Engine 1

Maps 2

Service 1

Maps 2

Territory 1

Maps 2

Service 1

Businesses 2

Charts 1

Businesses 2

Engine 1

R e d u c e 1

R e d u c e 2

T a s k 1

T a s k 2

Businesses 4

Charts 1

Engine 1

Maps 4

Territory 1

Service 1

Maps 2

Territory 1

Maps 2

Service 1

Businesses 2

Charts 1

Businesses 2

Engine 1

Reduce 2

Reduce 1

 We concatenate the outputs into final result

Businesses 4

Charts 1

Engine 1

Maps 4

Territory 1

Service 1

Businesses 4

Charts 1

Engine 1

Maps 4

Territory 1

Service 1

R e d

u c e 1

R e d u c e 2

 Apache Hadoop [] ◦ Open-source MapReduce implementation

 Tools using Hadoop: ◦ Hive: data warehouse infrastructure that provides data

summarization and ad hoc querying (HiveQL) ◦ Pig: high-level data-flow language and execution

framework for parallel computation (Pig Latin) ◦ Mahout: Scalable machine learning and data mining

library ◦ Flume: Flume is a distributed, reliable, and available

service for efficiently collecting, aggregating, and moving large amounts of log data

◦ Many more: Cascading, Cascalog, mrjob, MapR, Azkaban, Oozie, …

 “[…] need to solve a problem that relational databases are a bad fit for”, Eric Evans

 Motives: ◦ Avoidance of Unneeded Complexity – many use-case

require only subset of functionality from RDBMSs (e.g ACID properties)

◦ High Throughput – some NoSQL databases offer significantly higher throughput then RDBMSs

◦ Horizontal Scalability, Running on commodity hardware ◦ Avoidance of Expensive Object-Relational Mapping –

most NoSQL store simple data structures ◦ Compromising Reliability for Better Performance

Based on “NoSQL Databases”, Christof Strauch

 BASE approach ◦ Availability, graceful degradation, performance

◦ Stands for “Basically available, soft state, eventual consistency”

 Continuum of tradeoffs: ◦ Strict – All reads must return data from latest completed


◦ Eventual – System eventually return the last written value

◦ Read Your Own Writes – see your updates immediately

◦ Session – RYOW only within same session

◦ Monotonic – only more recent data in future requests

 Consistent hashing ◦ Use same function for

hashing objects and nodes

◦ Assign objects to nearest nodes on the circle

◦ Reassign object when nodes added or removed

◦ Replicate nodes to r nearest nodes

White, Tom: Consistent Hashing. November 2007. – Blog post of 2007-11-27.

 Storage Layout ◦ Row-based

◦ Columnar

◦ Columnar with Locality Groups

 Query Models ◦ Lookup in key-value stores

 Distributed Data Processing via MapReduce

Lipcon, Todd: Design Patterns for Distributed Non-Relational Databases. June 2009. – Presentation of 2009-06-11.

 Map or dictionary allowing to add and retrieve values per keys

 Favor scalability over consistency ◦ Run on clusters of commodity hardware ◦ Component failure is “standard mode of operation”

 Examples: ◦ Amazon Dynamo ◦ Project Voldemort (developed by LinkedIn) ◦ Redis ◦ Memcached (not persistent)

 Combine several key-value pairs into documents

 Documents represented as JSON

 Examples: ◦ Apache CouchDB

◦ MongoDB

" Title " : " CouchDB ",

" Last editor " : "" ,

" Last modified ": "9/23/2010" ,

" Categories ": [" Database ", " NoSQL ", " Document Database "],

" Body ": " CouchDB is a …" , " Reviewed ": false

 Using columnar storage layout with locality groups (column families)

 Examples: ◦ Google Bigtable

◦ Hypertable, HBase

 open source implementation of Google Bigtable

◦ Cassandra

 combination of Google Bigtable and Amazon Dynamo

 Designed for high write throughput

Infrastructure:  Kafka []

◦ A high-throughput distributed messaging system

 Hadoop [] ◦ Open-source map-reduce implementation

 Storm [] ◦ Real-time distributed computation system

 Cassandra [] ◦ Hybrid between Key-Value and Row-Oriented DB ◦ Distributed, decentralized, no single point of failure ◦ Optimized for fast writes