Tuesday, June 16, 2015
Saturday, February 7, 2015
Data Blending
Data Analyst follow traditional approaches using spreadsheets to find answer to straight forward questions, work with simple algorithms and formulas. They consider readily available data sources from few sources.
Data Categories like Enrichment and Emerging data are most likely to include sources of Big Data, in some cases structured data can also be included.
Once the right data sources are identified the access to those data sources is established, the next step is merging, sorting, joining and otherwise combining all useful data into a functional data set while discarding the vast, loud noise of unnecessary data - this process is called Data Blending
Data Blending is a Process, and that Process can be repeated as necessary to add or remove data sources.
Data Integration vs Data Blending
Data Integration is not data blending, in Data Integration multiple data sources are combined to create a single unified version of data in database, data warehouse or data mart.
Data Blending is a process conducted by a Business or Data Analyst to build a data set for use in analytic processing to answer a specific business question.
The data for the data set is created from one or more data sources, the blending occurs as the data set is built from multiple data sources to capture only the relevant data. Analytic Processing occurs on that purpose built data set to derive an answer for the question being posed.
Integration - results in a permanent database with the intent of storing a single copy of data is managed by DBA and BI experts
Blending - results in a data set with the purpose of supporting analysis for a specific business question and is created by business and data analytics.
When business questions become more Data Analyst require more complex algorithms or large amount of data from different sources.
Data exist in many formats - Structured(Relational Databases, Spreadsheets, Semi-Structured( Social Media posts, or Blog posts or comments) and Unstructured(Machine logs, twitter tweets)
In Addition to characterize data based on its format, a useful way to view data is based on its nature. Data naturally fits into three categories
Traditional - Data that comes from relational databases / spreadsheet / data from mainframe systems.
Enrichment - Data that is Industry specific or special purpose and used to supplement (or enrich) existing for example spatial grid coordinates identifying where customers like to shop would enrich sales information or demographic information about customers background could help a retailer looking at traditional sales data.
Emerging - Data that is often related to big data as well as other sources such as social media or marketing automation data are common examples of emerging data. This is newer more valuable and often the most difficult data to identify and leverage.
Data Categories like Enrichment and Emerging data are most likely to include sources of Big Data, in some cases structured data can also be included.
Once the right data sources are identified the access to those data sources is established, the next step is merging, sorting, joining and otherwise combining all useful data into a functional data set while discarding the vast, loud noise of unnecessary data - this process is called Data Blending
Data Blending is a Process, and that Process can be repeated as necessary to add or remove data sources.
Data Integration vs Data Blending
Data Integration is not data blending, in Data Integration multiple data sources are combined to create a single unified version of data in database, data warehouse or data mart.
Data Blending is a process conducted by a Business or Data Analyst to build a data set for use in analytic processing to answer a specific business question.
The data for the data set is created from one or more data sources, the blending occurs as the data set is built from multiple data sources to capture only the relevant data. Analytic Processing occurs on that purpose built data set to derive an answer for the question being posed.
The Key difference is
Integration - results in a permanent database with the intent of storing a single copy of data is managed by DBA and BI experts
Blending - results in a data set with the purpose of supporting analysis for a specific business question and is created by business and data analytics.
Thursday, October 16, 2014
Understanding Big Data
What is Big Data
Every company is now focusing on two things to do their business.
1) Their product or service targeting potential customers by selling or providing customer service.
2) Collecting and analyzing the data that is generated in this business process.
Storing
Typically data gets stored in various forms and mediums depending on these factors, they are its size, accessibility, availability and security.Analysis
Beyond data storage, this data needs to be processed, analyzed or predicted to enhance further business activities. It becomes convenient for a company when it can combine the process of storage and analysis together, this is the point where one should look for a Big Data.
Big Data Eco System
lets take a look a company IT infrastructure at different time periods
Linkedin @ 2003 Linkedin @ 2014
Traditional IT applications were once storing events like Act Registration, Deposits, Sales, Purchases etc.
But today the IT applications are smart enough to recommend the products to buy, what music or movie we would like, which stock to invest. ex - facebook friend recommendations, netflix movie recommendations, spotify music recommendations, amazon product recommendations etc.
In order to achieve all these a company should invest in Big Data.
Thus Big data is a group or collection of technologies which integrate with each other to facilitate real time solutions as business takes place.
Big data is not a like one size fits all, we need to add or remove the tools or technologies that suits our use cases.
Big Data Technologies Landscape
The following technologies integrate with each other to facilitate big data solution to a company.In Simple terms the technologies altogether form a big data solution for a company.
Cloud Storage (AWS / Windows Azure )
+
a HDFS file system (hadoop clusters / Map Reduce)
+
NoSQL databases (Cassandra / Hbase)
+
DW Infrastructure (Hive + Spark = Shark)
+
ETL (Talend / Informatica)
+
Analytics / Visualization (R / Python / Qlikview / BIRST)
+
Business Intelligence (Tableau / Microstrategy/ Cognos)
Monday, March 3, 2014
Big Data is a Technology & Data Warehouse is an Architecture
Bigdata (aka Hadoop) is gaining popularity in recent years. Often I hear people saying that we dont need a data warehouse if we have Big data.
I do agree that there are some similarities between a data warehouse and a big data solution.
Both can be used for Reporting.
Both are managed by electronic storage devices.
Both can hold lot of data
So if a company starts to build a Big data solution doesnt that obviate the need for a data warehouse?
What Big Data offers to an organization
- Technology capable of holding very large amounts of data.
- Technology that can hold the data in inexpensive storage devices.
- Technology where processing is done by the "Roman Census" method.
- Technology where the data is stored in an unstructured format.
What Data Warehouse offers to an organization
In principle there is the Kimball approach to data warehouse and Inmon approach to a data warehouse
The Inmon approach to data warehouse defines a data warehouse is a subject oriented, non volatile, integrated, time variant collection of data created for the purpose of management decision making.
In simple terms a data warehouse provides a single version of the truth for decision making in the corporation.
Companies need a data warehouse in order to make informed decisions from the data
that is reliable, believable, readily available and accessible to every one.
So what Big data offers in addition to data Warehouse -
A data warehouse is a way of organizing data so there is a credibility and integrity. We can do compliance reporting like Sarbanes-Oxley, Base II or other styles of compliance reporting we can depend on Data warehouse.
For all practical purposes a data warehouse and big data have little or no relationship. Finally to conclude The Data warehouse is an Architecture and Big data is a Technology.
I do agree that there are some similarities between a data warehouse and a big data solution.
Both can be used for Reporting.
Both are managed by electronic storage devices.
Both can hold lot of data
So if a company starts to build a Big data solution doesnt that obviate the need for a data warehouse?
What Big Data offers to an organization
- Technology capable of holding very large amounts of data.
- Technology that can hold the data in inexpensive storage devices.
- Technology where processing is done by the "Roman Census" method.
- Technology where the data is stored in an unstructured format.
What Data Warehouse offers to an organization
In principle there is the Kimball approach to data warehouse and Inmon approach to a data warehouse
The Inmon approach to data warehouse defines a data warehouse is a subject oriented, non volatile, integrated, time variant collection of data created for the purpose of management decision making.
In simple terms a data warehouse provides a single version of the truth for decision making in the corporation.
Companies need a data warehouse in order to make informed decisions from the data
that is reliable, believable, readily available and accessible to every one.
So what Big data offers in addition to data Warehouse -
In large corporations there is lot of data which are not transported into their data warehouse.
There are numerous reasons for not exporting this data to their data warehouses.
[This data cannot be De-normalized or require more additional data to be imported into data warehouse.]
[This data cannot be De-normalized or require more additional data to be imported into data warehouse.]
For example - Tweets and Facebook posts regarding a product or service discussed by the consumers really helps the companies to understand the consumers opinion about the product or service.
By understanding the feedback or comments these companies can make changes accordingly
If a company can unlock this valuable unstructured data into a meaningful information from various sources and then combine them with the reports from their data warehouse they can accurately predict what their customer wants and how it reflect their sales & revenue.
The difference between a Big data and Data warehouse is the difference between a hammer and nail.
Big data is a technology and Data warehouse is an architecture. A technology is just a means to store and manage large amount of data. The difference between a Big data and Data warehouse is the difference between a hammer and nail.
A data warehouse is a way of organizing data so there is a credibility and integrity. We can do compliance reporting like Sarbanes-Oxley, Base II or other styles of compliance reporting we can depend on Data warehouse.
For all practical purposes a data warehouse and big data have little or no relationship. Finally to conclude The Data warehouse is an Architecture and Big data is a Technology.
Monday, February 17, 2014
High Paying Analytic Skills
In 2014 Dice Tech Salary Survey of over 17,000 technology professionals, highest paid IT skill was R programming.
While big data skills in general featured strongly in the top tier, having R at the top of the list reflects the strong demand for skills to make sense of and extract value from big data.
Similarly, the recent O`Rielly Data Scientist survey also found R skills amongst those that pay in the $111,000 - $125,000 range.
Sunday, February 16, 2014
Digital Intelligence with Splunk
Splunk is a flexible and Powerful platform for machine data. It provides an impactful way to analyze customer behavior and product usage from websites, mobile apps and social media streams.
Splunk helps us to achieve
1. Reliably collect data from various user interactions - web, mobile, social and offline
2. Get meaningful insights and powerful visualization with unlimited segmentation and full
data drill down on real time and historical data
3. Correlate data across various digital channels
4. Create reports, dashboards and alerts for meaningful actions based on trends
5. Use splunk DB connect and Hadoop connect to enrich streaming unstructured data with structured data from relational database or enable movement of data to Hadoop for complex batch analysis.
6. Understanding Web Behavior in Real Time
7. Improving Mobile App User Experience
In general, Splunk features can be summarized as below.
Index all types of Data Formats - Splunk indexes virtually any data and data
data format across your infrastructure in real time.
Ad hoc Search - Search terabytes of historical data and live streaming data using
the powerful splunk search language.
Monitor and Alert - Monitor your data for patterns, breakout trends or specific
events and turn these into proactive alerts.
Report and Analysis - Build powerful reports in minutes, visualize your data,
perform statistical analysis, spot trends and share your reports.
Custom Dashboards - Create custom dashboards in a few clicks, integrate multiple
charts and views of your data for needs of different users.
Advanced Visualization - Integrates maps and more complex visualizations within
splunk dashboards
Role based Access - Provide secure, role based access control to any one in your
organization.
DB Connect - Enrich unstructured data with structured data from relational database
Massive Linear Scalability - Scale splunk linearly across commodity servers
to support the largest of data volumes.
Splunk helps us to achieve
1. Reliably collect data from various user interactions - web, mobile, social and offline
2. Get meaningful insights and powerful visualization with unlimited segmentation and full
data drill down on real time and historical data
3. Correlate data across various digital channels
4. Create reports, dashboards and alerts for meaningful actions based on trends
5. Use splunk DB connect and Hadoop connect to enrich streaming unstructured data with structured data from relational database or enable movement of data to Hadoop for complex batch analysis.
6. Understanding Web Behavior in Real Time
7. Improving Mobile App User Experience
In general, Splunk features can be summarized as below.
Index all types of Data Formats - Splunk indexes virtually any data and data
data format across your infrastructure in real time.
Ad hoc Search - Search terabytes of historical data and live streaming data using
the powerful splunk search language.
Monitor and Alert - Monitor your data for patterns, breakout trends or specific
events and turn these into proactive alerts.
Report and Analysis - Build powerful reports in minutes, visualize your data,
perform statistical analysis, spot trends and share your reports.
Custom Dashboards - Create custom dashboards in a few clicks, integrate multiple
charts and views of your data for needs of different users.
Advanced Visualization - Integrates maps and more complex visualizations within
splunk dashboards
Role based Access - Provide secure, role based access control to any one in your
organization.
DB Connect - Enrich unstructured data with structured data from relational database
Massive Linear Scalability - Scale splunk linearly across commodity servers
to support the largest of data volumes.
Tuesday, February 11, 2014
Business Intelligence with Big Data
Business Intelligence environment is the top layer for any Data warehouse Platform, Querying & Analyzing, Alerts, Report publishing takes place inside this environment.
An handful of BI Tools are available in the market addressing these reporting requirements for decades.
With infinite storage and data exploration through Big Data technologies like Hadoop, Companies can extend their BI capabilities beyond querying Relational Data sets and extending the same to unstructured and schema less data sets.
Data Science, is now taking up the Business Intelligence to the next level and opens up an opportunity to visualize the large scale structured and unstructured data.
Subscribe to:
Posts (Atom)