Reading Pentaho Kettle Solutions

On a rainy day, there's nothing better than to be sitting by the stove, stirring a big kettle with a finely turned spoon. I might be cooking up a nice meal of Abruzzo Maccheroni alla Chitarra con Polpettine, but actually, I'm reading the ebook edition of Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration on my iPhone.

Some of my notes made while reading Pentaho Kettle Solutinos:

…45% of all ETL is still done by hand-coded programs/scripts… made sense when… tools have 6-figure price tags… Actually, some extractions and many transformations can't be done natively in high-priced tools like Informatica and Ab Initio.

Jobs, transformations, steps and hops are the basic building blocks of KETTLE processes

It's great to see the Agile Manisto quoted at the beginning of the discussion of AgileBI. 

BayAreaUseR October Special Event

Zhou Yu organized a great special event for the San Francisco Bay Area Use R group, and has asked me to post the slide decks for download. Here they are:

No longer missing is the very interesting presentation by Yasemin Atalay showing the difference in plotting analysis using the Windermere Humic Aqueous Model for river water environmental factors, without using R and then the increased in variety and accuracy of analysis and plotting gained by using R.

Search Terms for Data Management & Analytics

Recently, for a prospective customer, I created a list of some search terms to provide them with some "late night" reading on data management & analytics. I've tried these terms out on Google, and as suspected, for most, the first hit is for Wikipedia. While most articles in Wikipedia need to be taken with a grain of salt, they will give you a good overview. [By the way, I use the "Talk" page on the articles to see the discussion and arguments about the article's content as an indicator of how big a grain of salt is needed for that article] &#59;) So plug these into your favorite search engine, and happy reading.

  • Reporting - top two hits on Google are Wikipedia, and, interestingly, Pentaho
  • Ad-hoc reporting
  • OLAP - one of the first page hits is for Julian Hyde's blog, creator of the open source tool for OLAP, Mondrian, as well as real-time analytics engine, SQLstream
  • Enterprise dashboard - interestingly, Wikipedia doesn't come up in the top hits for this term on Google, so here's a link for Wikipedia: http://en.wikipedia.org/wiki/Dashboards_(management_information_systems)
  • Analytics - isn't very useful as a search term, but the product page from SAS gives a nice overview
  • Advanced Analytics - is mostly marketing buzz, so be wary of anything that you find using this as search term

Often, Data Mining, Machine Learning and Predictives are used interchangeably. This isn't really correct, as you can see from the following five search terms…

  • Data Mining
  • Machine Learning
  • Predictive Analytics
  • Predictive Intelligence - is an earlier term for Predictives that has mostly been supplanted by Predictive Analytics. I actually prefer just "Predictives".
  • PMML - Predictive Modeling Markup Language - is a way of transporting predictive models from one software package to another. Few packages will both export and import PMML. The lack of that capability can lock you into a solution, making it expensive to change vendors. The first hit for PMML on Google today is the Data Mining Group, which is a great resource. One company listed, Zementis, is a start-up that is becoming a leader in running data mining and predictive models that have been created anywhere
  • R - the R statistical language, is difficult to search on Google. Go to http://www.r-project.org/ and http://www.rseek.org/ … instead. R is useful for writing applications for any type of statistical analysis, and is invaluable for creating new algorithms and predictive models
  • ETL - Extract, Transform & Load, is the most common way of getting information from source systems to analytic systems
  • ReSTful Web Services - Representational State Transfer - can expose data as a web service using the four verbs of the web
  • SOA
  • ADBMS - Analytic Database Management Systems doesn't work well as a search term. Start with the Eigenbase.org site and follow the links from the Eigenbase subproject, LucidDB. Also, check out AsterData
  • Bayes - The Reverend Thomas Bayes came up with this interesting approach to statistical analysis in the 1700s. I first started creating Bayesian statistical methods and algorithms for predicting reliability and risk associated with solid propellant rockets. You'll find good articles using Bayes as a search term in Google. A bit denser article can be found at http://www.scholarpedia.org/article/Bayesian_statistics And some interesting research using Bayes can be found at: Andrew Gelman's Blog. You're likely familiar with one common Bayesian algorithm, naïve Bayes, which is used by most anti-spam email programs. Other forms are objective Bayes with non-informative priors and the original Subjective Bayes. I have an old aerospace joke about the Rand Corporation's Delphi method, based on subjective Bayes :-) I created my own methodology, and don't really care for naïve Bayes nor non-informative priors.
  • Sentiment Analysis - which is one of Seth Grimes' current areas of research
  • Decision Support Systems - in addition to searching on Google, you might find my recent OSS DSS Study Guide of interest

Let me know if I missed your favorite search term for data management & analytics.

Data Artisan Smith or Scientist

Over the past few months, a debate has been proceeding on whether or not a new discipline, a new career path, is emerging from the tsunami of data bearing down on us. The need for a new type of Renaissance [Wo]Man to deal with the Big Data onslaught. To whit, Data Science.

I'm writing about this now, because last night, at an every-three-week get together devoted to cask beer and data analysis, the topic came up. [Yes, every-THREE-weeks - a month is too long to go without cask beer fueled discussions of Rstats, BigData, Streaming SQL, BI and more.] The statisticians in the group, including myself, strongly disagreed with the way the term is being used; the software/database types were either in favor or ambivalent. We all agreed that a new, interdisciplinary approach to Big Data is needed. Oh, and I'll stay on topic here, and not get into another debate as to the definition of "Big Data". &#59;)

This lively conversation reinforced my desire to write about Data Science that swelled up in me after reading "What is Data Science?" by Mike Loukides published on O'Reilly Radar, and a subsequent discussion on Twitter held the following weekend, concerning data analytics.

The term "Data Science" isn't new, but it is taking on new meanings. The Journal of Data Science published JDS volume 1, issue 1 in January of 2003. The Scope of the JDS is very clearly related to applied statistics

By "Data Science", we mean almost everything that has something to do with data: Collecting, analyzing, modeling...... yet the most important part is its applications --- all sorts of applications. This journal is devoted to applications of statistical methods at large.
-- About JDS, Scope, First Paragraph

There is also the CODATA Data Science Journal, which appears to have last been updated in August of 2007, and currently has no content, other than its self-description as

The Data Science Journal is a peer-reviewed electronic journal publishing papers on the management of data and databases in Science and Technology.

I think that two definitions can be derived from these two journals.

  1. Data Science is systematic study, through observation and experiment, of the collection, modeling, analysis, visualization, dissemination, and application of data.
  2. Data Science is the use of data and database technology within physical and natural sciences and engineering.

I can agree with the first, especially with the JDS Scope clearly stating that Data Science is applied statistics.

The New Oxford American Dictionary, on which the Apple Dictionary program is based, defines science as a noun

the intellectual and practical activity encompassing the systematic study of the structure and behaviour of the physical and natural world through observations and experiments.

And a similar definition of science can be found on Dictionary.com.

In many ways, I like Mike Loukides' article "What is Data Science?" in how it highlights the need for this new discipline. I just don't like what he describes to be the new definition of "data science". Indeed, I very much disagree with this statement from the article.

Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We're increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.

A statistician is not an actuary. They're very different roles. I know this because I worked for over a decade applying statistics to determining the reliability and risk associated with very large, complex systems such as rockets and space-borne astrophysics observatories. I once hired a Cal student as an intern because she feared that the only career open to her as a math major, was to be an actuary. I showed her a different path. So, yes, I know, from experience, that a statistician is not an actuary. Actually, the definition of a data scientist given, that is "gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others" is exactly what a statistician does.

I do however see the need for a new discipline, separate from applied statistics, or data science. The massive amount of data to come from an instrumented world with strongly interconnected people and machines, and real-time analysis, inference and prediction from those data, will require inter-disciplinary skills. But I see those skills coming together in a person who is more of a smith, or, as Julian Hyde put it last night, an artisan. Falling back on the old dictionary again, a smith is someone who is skilled in creating something with a specific material; an artisan is someone who is skilled in a craft, making things by hand.

Another reason that I don't like the term "data science" for this interdisciplinary role, stems from what Mike Loukides describes in his article "What is Data Science?" as the definition for this new discipline "Data science requires skills ranging from traditional computer science to mathematics to art". I agree that the new discipline requires these three things, and more, even softer skills. I disagree that these add up to data science.

I even prefer "data geek", as defined by Michael E. Driscoll in "The Three Sexy Skills of Data Geeks". Michael Driscoll's post of 2009 May 27 certainly agrees skill-wise with Mike Loukides post of 2010 June 02.

  1. Skill #1: Statistics (Studying)
  2. Skill #2: Data Munging (Suffering)
  3. Skill #3: Visualization (Storytelling)

And I very much prefer "Data Munging" to "Computer Science" as one of the three skills.

I'll stick to the definition that I gave above for data science as "systematic study, through observation and experiment, of the collection, modeling, analysis, visualization, dissemination, and application of data". This is also applied statistics. So, what else is needed for this new discipline? Well, Mike and Michael are correct: computer skills, especially data munging, and art. Well, any statistician today has computer skills, generally in one or more of SAS, SPSS, R, S-plus, Python, SQL, Stata, MatLab and other software packages, as well as familiarity with various data storage & management methods. Some statisticians are even artists, perhaps as story tellers, as evidenced by that rare great teacher or convincing expert witness, perhaps as visualizers, creating statistically accurate animations to clearly describe the analysis, as evidenced by the career of that intern I hired so many years ago.

The data smith, the data artisan, must be comfortable with all forms of data:

  • structured,
  • unstructured and
  • semi-structured

Just as any other smith, someone following this new discipline might serve an apprenticeship creating new things from these forms of data such as a data warehouse or an OLAP cube, a sentiment analysis or a streaming SQL sensor web, or a recommendation engine or complex system predictives. The data smith must become very comfortable with putting all forms of data together in new ways, to come to new conclusions.

Just as a goldsmith will never make a piece of jewelry identical to the one finished days before, just as art can be forged but not duplicated, the data smith, the data artisan will glean new inferences every time they look at the data, will make new predictions with every new datum, and the story they tell, the picture they paint, will be different each time.

And perhaps then, the data smith becomes a master, an artisan.

PS: Here's a list of links to that Twitter conversation among some of the most respected people in the biz, on Data Analytics

  1. https://twitter.com/NeilRaden/status/15512935981
  2. https://twitter.com/NeilRaden/status/15513225191
  3. https://twitter.com/NeilRaden/status/15513275261
  4. https://twitter.com/NeilRaden/status/15513453916
  5. https://twitter.com/datachick/status/15513460384
  6. https://twitter.com/NeilRaden/status/15513488053
  7. https://twitter.com/datachick/status/15513677836
  8. https://twitter.com/CMastication/status/15513772446
  9. https://twitter.com/NeilRaden/status/15513821393
  10. https://twitter.com/NeilRaden/status/15513854916
  11. https://twitter.com/NeilRaden/status/15513915694
  12. https://twitter.com/alecsharp/status/15513980301
  13. https://twitter.com/NeilRaden/status/15514104372
  14. https://twitter.com/alecsharp/status/15514097194
  15. https://twitter.com/CMastication/status/15514374095
  16. https://twitter.com/estrenuo/status/15514634644
  17. https://twitter.com/NeilRaden/status/15515243453
  18. https://twitter.com/CMastication/status/15516185085
  19. https://twitter.com/annmariastat/status/15516321715
  20. https://twitter.com/NeilRaden/status/15519544709
  21. https://twitter.com/NeilRaden/status/15519597061
  22. https://twitter.com/NeilRaden/status/15519621974
  23. https://twitter.com/skemsley/status/15519932631
  24. https://twitter.com/skemsley/status/15519932631
  25. https://twitter.com/aristippus303/status/15520146540
  26. https://twitter.com/NeilRaden/status/15520478566
  27. https://twitter.com/SethGrimes/status/15520765766
  28. https://twitter.com/SethGrimes/status/15520851678
  29. https://twitter.com/NeilRaden/status/15521050387
  30. https://twitter.com/NeilRaden/status/15521106901
  31. https://twitter.com/NeilRaden/status/15521133647
  32. https://twitter.com/NeilRaden/status/15521192977
  33. https://twitter.com/SethGrimes/status/15521579977
  34. https://twitter.com/ryanprociuk/status/15521637974

Technology for the OSS DSS Study Guide

'Tis been longer than intended, but we finally have the technology, time and resources to continue with our Open Source Solutions Decision Support System Study Guide (OSS DSS SG).

First, I want to thank SQLstream for allowing us to use SQLstream as a part of our solution. As mentioned in our "First DSS Study Guide" post, we were hoping to add a real-time component to our DSS. SQLstream is not open source, and not readily available for download. It is however, a co-founder and core contributer to the open source Eigenbase Project, and has incorporated Eigenbase technology into its product. So, what is SQLstream? To quote their web site, "SQLstream enables executives to make strategic decisions based on current data, in flight, from multiple, diverse sources". And that is why we are so interested in having SQLstream as a part of our DSS technology stack: to have the capability to capture and manipulate data as it is being generated.

Today, there are two very important classes of technologies that should belong to any DSS: data warehousing (DW) and business intelligence (BI). What actually comprises these technologies is still a matter of debate. To me, they are quite interrelated and provide the following capabilities.

  • The means of getting data from one or more sources to one or more target storage & analysis systems. Regardless of the details for the source(s) and the target(s), the traditional means in data warehousing is Extract from the source(s), Transform for consistency & correctness, and Load into the target(s), that is, ETL. Other means, such as using data services within a services oriented architecture (SOA) either using provider-consumer contracts & Web Service Definition Language (WSDL) or representational state transfer (ReST) are also possible.
  • Active storage over the long term of historic and near-current data. Active storage as opposed to static storage, such as a tape archive. This storage should be optimized for reporting and analysis through both its logical and physical data models, and through the database architecture and technologies implemented. Today we're seeing an amazing surge of data storage and management innovation, with column-store relational database management systems (RDBMS), map-reduce (M-R), key-value stores (KVS) and more, especially hybrids of one or several of old and new technologies. The innovation is coming so thick and fast, that the terminology is even more confused than in the rest of the BI world. NoSQL has become a popular term for all non-RDBMS, and even some RDBMS like column-store. But even here, what once meant No Structured Query Language now is often defined as Not only Structured Query Language, as if SQL was the only way to create an RDBMS (can someone say Progress and its proprietary 4GL).
  • Tools for reporting including gathering the data, performing calculations, graphing, or perhaps more accurately, charting, formating and disseminating.
  • Online Analytical Processing (OLAP) also known as "slice and dice", generally allowing forms of multi-dimensional or pivot analysis. Simply put, there are three underlying concepts for OLAP: the cube (a.k.a. hypercube, multi-dimensional database [MDDB] or OLAP engine), the measures (facts) & dimensions, and aggregation. OLAP provides much more flexibility than reporting, though the two often work hand-in-hand, especially for ad-hoc reporting and analysis.
  • Data Mining, including machine learning and the ability to discover correlations among disparate data sets.

For our purposes, an important question is whether or not there are open source, or at least open source based, solutions for all of these capabilities. The answer is yes. As a matter of fact, there are three complete open source BI Suites [there were four, but the first, written in PERL, the Bee Project from the Czech Republic, is no longer being updated]. Here's a brief overview of SpagoBI, JasperSoft, and Pentaho.

Capability SpagoBI JasperSoft Pentaho
ETL Talend Talend
JasperETL
KETTLE
PDI
Included
DBMS
HSQLDB
MySQL
Reporting BIRT
JasperReport
JasperReports
iReports
jFreeReports
Analyzer jPivot
PaloPivot
JasperServer
JasperAnalysis
jPivot
PAT
OLAP Mondrian Mondrian Mondrian
Data Mining Weka None Weka

We'll be using Pentaho, but you can use any of the these, or any combination of the OSS projects that are used by these BI Suites, or pick and choose from the more than 60 projects in our OSS Linkblog, as shown in the sidebar to this blog. All of the OSS BI Suites have many more features than shown in the simple table above. For example, SpagoBI has good tools for geographic & location services. Also, JasperSoft Professional and Enterprise Editions have many features than their Community Edition, such as Ad Hoc Reporting and Dashboards. Pentaho has a different Analyzer in their Enterprise Edition than either jPivot or PAT, Pentaho Analyzer, based upon the SaaS ClearView from the now-defunct LucidEra, as well as ease-of-use tools such as an OLAP schæma designer, and enterprise class security and administration tools.

Data warehousing using general purpose RDBMS systems such as Oracle, EnterpriseDB, PostrgeSQL or MySQL, are gradually giving way to analytic database management system (ADBMS), or, as we mentioned above, the catch-all NoSQL data storage systems, or even hybrid systems. For example, Oracle recently introduced hybrid column-row store features, and Aster Data has a column-store Massive Parallel Processing (MPP) DBMS|map-reduce hybrid [updated 20100616 per comment from Seth Grimes]. Pentaho supports Hadoop, as well as traditional general purpose RDBMS and column-store ADMBS. In the open source world, there are two columnar storage engines for MySQL, Infobright and Calpont InfiniDB, as well as one column-store ADBMS purpose built for BI, LucidDB. We'll be using LucidDB, and just for fun, may throw some data into Hadoop.

In addition, a modern DSS needs two more primary capabilities. Predictives, sometimes called predictive intelligence or predictive analytics (PA), which is the ability to go beyond inference and trend analysis, assigning a probability, with associated confidence, or likelihood of an event occurring in the future, and full Statistical Analysis, which includes determining the probability density or distribution function that best describes the data. Of course, there are OSS projects for these as well, such as The R Project, the Apache Common Math libraries, and other GNU projects that can be found in our Linkblog.

For statistical analysis and predictives, we'll be using the open source R statistical language and the open standard predictive model markup language (PMML), both of which are also supported by Pentaho.

We have all of these OSS projects installed on a Red Hat Enterprise Linux machine. The trick will be to get them all working together. The magic will be in modeling and analyzing the data to support good decisions. There are several areas of decision making that we're considering as examples. One is fairly prosaic, one is very interesting and far-reaching, and the others are somewhat in between.

  1. A fairly simple example would be to take our blog statistics, a real-time stream using SQLstream's Twitter API, and run experiments to determine whether or not, and possibly how, Twitter affects traffic to and interaction with our blogs. Possibly, we could get to the point where we can predict how our use of Twitter will affect our blog.
  2. A much more far-reaching idea was presented by Ken Winnick to me, via Twitter, and has created an on-going Twitter conversation and hashtag, #BPgulfDB. Let's take crowd sourced, government, and other publicly available data about the recent oilspill in the Gulf of Mexico, and analyze it.
  3. Another idea is to take historical home utility usage plus current smart meter usage data, and create a real-time dashboard, and even predictives, for reducing and managing energy usage.
  4. We also have the opportunity of using public data to enhance reporting and analytics for small, rural and research hospitals.

March 2019
Mon Tue Wed Thu Fri Sat Sun
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
 << <   > >>
The TeleInterActive Press is a collection of blogs by Clarise Z. Doval Santos and Joseph A. di Paolantonio, covering the Internet of Things, Data Management and Analytics, and other topics for business and pleasure. 37.540686772871 -122.516149406889

Search

Categories

The TeleInterActive Lifestyle

Yackity Blog Blog

The Cynosural Blog

Open Source Solutions

DataArchon

The TeleInterActive Press

  XML Feeds