Are approximate answers the best way to analyze big data

January 13, 2017, 2:36 am

≫ Next: MATCH_RECOGNIZE - What should I include in the MEASURE clause?

≪ Previous: SQL Pattern Matching Deep Dive - Part 6, state machines

Lots of candy

Image courtesy of pixabay.com

In my previous post I reviewed some reasons why people seem reluctant to accept approximate results as being correct and useful. The general consensus is that approximate results are wrong which is very strange when you consider how often we interact with approximations as part of our everyday life.

Most of the use cases in my first post on this topic covered situations where distinct counts were the primary goal - how many click throughs did an advert generate, how many unique sessions were recorded for a web site etc. The use cases that I outlined provided some very good reasons for using approximations of distinct counts. As we move forward into the era of Analytics-of-Things the use of approximations in queries will expand and this approach to processing data will become an accepted part of our analytical workflows.

To support Analytics-of-Things, Database 12c Release 12.2 includes even more approximate functions. In this release we have added approximations for median and percentile computations and support for aggregating approximate results (counts, median and percentiles).

What is a median and percentile?

A quick refresher course….according to wikipedia a percentile is:

a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

Percentiles are prefect for locating outliers in your data set. In the vast majority of cases you can start with the assumption that a data set exhibits a normal distribution. Therefore if you take the data around the 0.13th and 99.87th percentiles (i.e. outside 3 standard deviations from the mean) then you get the anomalies. Percentiles are great for allowing you to quickly eyeball the distribution of a data set so that you can check for skew or bimodalities etc. Probably, the most common use case is around monitoring service levels where these anomalies are the values of most interest.

On the other hand, a median is:

the number separating the higher half of a data sample, a population, or a probability distribution, from the lower half.

Why would you use median rather than the mean? In other words, what are the use cases that require median? Median is great at removing the impact of outliers because the data is sorted and then the middle value is extracted. The average is susceptible to be skewed by outliers. A great use case for median is in resource planning. If you want to know how many staff you should assign to manage your web-store application you might create a metric based on number of sessions during the year. With a web-store the number of sessions will peak around key dates such as July 4th and Thanksgiving. Calculating the average number of sessions over the year will be skewed by these two dates and you will probably end-up with too many staff looking after your application. Using the median removes these two spikes and will return a more realistic figure for the number of sessions per day during the year.

But before you start to consider where, when, how or even if you want to consider using approximate calculations you need to step back for a moment and think about the accuracy of your existing calculations, which I am guessing you think are 100% accurate!

Is your data accurate anyway?

Most business users work on the assumption that the data set they are using is actually 100% accurate and for the vast majority of operational sources flowing into the data warehouse this is probably true although there will always parentless dimension values and in some cases “Other” bucket dimension members to create some semblance of accuracy.

As we start to explore big data related sources pulled from untrusted external sources and IoT sensor streams, which typically are inherently “noisy”, then the level of “accuracy” within the data warehouse starts to become a range rather than a single specific value.

Let’s quickly explore the three key ways that noise gets incorporated into data sets:

1) Human errors

Human input errors: probably the most obvious. It affects both and internal and external sources that rely on human input or interpretation of manually prepared data. Free format fields on forms create all sorts of problems because the answers need to be interpreted. Good examples are insurance claim forms, satisfaction surveys, crime reports, sales returns forms etc

2) Coding errors

ETL errors: Just about every data source feeding a data warehouse goes through some sort of ETL process. Whilst this is sort of linked to the first group of errors it does fall into this group simply because of the number of steps involved in most ETL jobs. There are some many places where errors can be introduced
Rounding and conversion errors: When an ETL job takes source data, converts it and then aggregates it before pushing it into the warehouse it will always be difficult to back trace the aggregated numbers down to the source data because of inherent rounding errors. When dealing with currency exchange rates it can be a little difficult to tie-back source data in one currency to the aggregated data in the common currency dues to tiny rounding errors.

3) Data Errors

Missing data points: Data always get lost in translation somewhere down the line or is simply just out of date. In many cases this is the biggest source of errors. For example, one bank recently put together a marketing campaign to stop customer churn. Before they launched they campaign one of their data scientists did some deeper analysis and discovered that the training data for the model included customers who were getting divorced and this was being flagged as a lost customer. Including this group ended up skewing the results. The data about changes to marital status was not being pushed through fast enough to the data warehouse.
Meaningless or distracting data points: with the growth in interest in the area of IoT it is likely that this type of “noise” will become more prevalent in data sets. Sensor data is rarely 100% accurate mainly because in many cases it does not need to deliver that level of accuracy. The volume of data being sent from the sensor will allow you to easily remove or flag meaningless or distracting data. With weblogs it is relatively easy to ignore click-events where a user clicks on an incorrect link and immediately clicks the back-button.

In other words, in many situations, getting precise answers is nothing but an illusion: even when you process your entire data, the answer is still an approximate one. So why not use approximation to your computational advantage and in a way where the trade off between accuracy and efficiency is controlled by you?

Use cases for these new features

There are a lot of really good use cases for these types of approximations but here are my two personal favorites:

Hypothesis testing— a good example of this is A/B testing which is most commonly used in conjunction with website design and ads design to select the page design or ad that generates the best response. With this type of analysis it is not vital that you have accurate, precise values .What is needed is the ability to reliably compare results and approximations are good normally enough.
Ranking — How does your ISP calculate your monthly usage so they can bill you fairly for your usage? They use a percentile calculation where they will remove the top 5% - 2%, of your bandwidth peaks. and then use that information to calculate your bill. By using data below the 95th-98th percentile they can ignore the infrequent peaks when say your are downloading the lasted update to your Android or iOS device. Again, having precise numbers for this percentile cut-off is not really necessary. A good enough approximation of the 95th percentile is usually going to be sufficient because it implies that approximately 95% of the time, your usage is below the data volume identified around that percentile. An conversely the remaining 5% of the time, your usage creeps above that amount.

Of course all the use cases that we considered for distinct counts in the first posts are also valid:

Discovery analytics: data analysts often slice and dice their dataset in their quest for interesting trends, correlations or outliers. If your application falls into this type of explorative analytics, getting an approximate answer within a second is much better compared to waiting twenty minutes for an exact answer. In fact, research on human-computer interaction has shown that, to keep business users engaged and productive, the response times for queries must be below 10 seconds. In particular, if the user has to wait for the answer to their query for more than a couple of seconds then their level of analytical thinking can be seriously impaired.
Market testing: most common use case for market testing is around serving ads on websites. This is where two variants of a specific ad (each with a group of slightly different attributes such as animations or colour schemes) are served up to visitors during a session. The objective is to measure which version generates a higher conversion rate (i.e. more click-throughs). The analytics requires counting the number of clicks per ad with respect to the number of times each ad was displayed. Using an approximation of the number of click-throughs is perfectly acceptable. This is similar to the crowd-counting problem where it is not really necessary to report exactly how many people joined a rally or turned up to an event.
Root cause analysis: contrary to perceived wisdom, this can in fact be accomplished using approximations. Typically RCA follows a workflow model where results from one query trigger another query, which in turn triggers another related query. Approximations are used to speed up that the decision as to whether or not to continue with a specific line of analysis. Of course you need to incorporate the likelihood of edge cases within your thinking process because there is the danger that the edge values will get lost within the general hashing process.

however, in these examples we usually end up merging or blending the first two use cases with the three above to gain a deeper level of insight so now let’s look at the new approximate statistical functions introduced in Database 12.2

Approximate median and percentile

With Database 12c Release 2 we have added two new approximate functions:

APPROX_PERCENTILE(%_number [DETERMINISTIC], [ERROR_RATE|CONFIDENCE]) WITHIN GROUP (ORDER BY expr [ DESC | ASC ])

This function takes three input arguments. The first argument is numeric type ranging from 0% to 100%. The second parameter is optional. If ‘DETERMINISTIC’ argument is provided, it means user requires deterministic results. If it is not provided, it means deterministic results are not mandatory. The input expression for the function is derived from the expr in the ORDER BY clause.

The approx_median function has the following syntax:

APPROX_MEDIAN(expr [DETERMINISTIC], [ERROR_RATE|CONFIDENCE]
)

We can use these functions separately or together as shown here using the SH schema:

SELECT
calendar_year,
APPROX_PERCENTILE(0.25) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.25",
TRUNC(APPROX_PERCENTILE(0.25, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.25-er",
TRUNC(APPROX_PERCENTILE(0.25, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.25-ci",
APPROX_MEDIAN(amount_sold deterministic) as "p-0.50",
TRUNC(APPROX_MEDIAN(amount_sold deterministic, 'ERROR_RATE'),2) as "p-0.50-er",
TRUNC(APPROX_MEDIAN(amount_sold deterministic, 'CONFIDENCE'),2) as "p-0.50-ci",
APPROX_PERCENTILE(0.75 deterministic) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.75",
TRUNC(APPROX_PERCENTILE(0.75, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.75-er",
TRUNC(APPROX_PERCENTILE(0.75, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.75-ci"
FROM sales s, times t
WHERE s.time_id = t.time_id
GROUP BY calendar_year
ORDER BY calendar_year

The results from the above query are shown below

Resultset showing approx median and percentile calculations

Note that for the APPROX_MEDIAN function I have included the keyword “DETERMINISTIC”. What does this actually mean?

Due to the nature of computing approximate percentiles and medians it is not possible to provide a specific and constant value for the error rate or the confidence interval. However, when we have used a large scale real world customer data set (manufacturing use case) we saw an error range of around 0.1 - 1.0%. Therefore, in broad general terms, accuracy will not be a major concern.

Error rates and confidence intervals

How closely an approximate answers matches the precise answer is gauged by two important statistics:

margin of error
confidence level.

These two pieces of information tell us how well the approximation represents the precise value. For example, a result may have a margin of error of plus or minus 3 percent at a 95 percent level of confidence. These terms simply mean that if the analysis were conducted 100 times, the data would be within a certain number of percentage points above or below the percentage reported in 95 of the 100 runs.

In other words, Company X surveys customers and finds that 50 percent of the respondents say its customer service is “very good.” The confidence level is cited as 95 percent plus or minus 3 percent. This information means that if the survey were conducted 100 times, the percentage who say service is “very good” will range between 47% and 53% most (95%) of the time (for more information see here: https://www.isixsigma.com/tools-templates/sampling-data/margin-error-and-confidence-levels-made-simple/).

Please note that if you search for more information about error rates and confidence levels then a lot of results will talk about sample size and working back from typical or expected error rates and confidence levels to determine the sample size needed. With approximate query process we do not sample the source data. We always read all the source values, there is no sampling!

Performance - how much faster is an approximate result?

As a test against a real world schema we took a simple query from the customer that computed a number of different median calculations:

SELECT count(*) FROM (SELECT /*+ NO_GBY_PUSHDOWN */ b15, median(b4000), median(b776), median(e), median(f), median(n), median(z) FROM mdv group by b15);

Performance Graph for non-approximate query

As you can see from the real-time monitoring page, the query accessed 105 million rows and the calculations generated 11GB of temp. That’s a lot of data for one query to spill to disk!

Resource usage for non-approximate query

Now if we convert the above query to use the approx_median function and rerun the query we can see below that we get a very different levels of resource usage:

Performance graph for approximate query

Looking closely at the resource usage you can see that the query is 13x faster, uses considerably less memory (830Kb vs 1GB) but most importantly there is no usage of temp:

Resource graph for approximate qquery

Summary

One of the most important take-aways from this post relates to the fact that we always read all the source data. The approximate functions in Database 12c Release 2 do not using sampling as a way to increase performance. These new features are significantly faster and use fewer resources which means more resources are available for other queries - allowing you to do more with the same level of resources.

Technorati Tags: Analytics, Big Data, Data Warehousing, Database 12c, Oracle Database 12c, SQL, Statistics

↧

MATCH_RECOGNIZE - What should I include in the MEASURE clause?

January 16, 2017, 5:01 am

≫ Next: Dealing with very very long string lists using Database 12.2

≪ Previous: Are approximate answers the best way to analyze big data

Image courtesy of wikipedia

This post is the result of reviewing a post on stackoverflow.com: http://stackoverflow.com/questions/41649178/getting-error-ora-00918-when-using-match-recognize. Here is my version of the code which includes the same issues/errors as the original, however, I am using the TICKER schema table that I always use for the tutorials that I post on liveSQL :

SELECT symbol, tstamp, price 
FROM ticker 
MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY symbol, tstamp
 MEASURES
   a.symbol AS a_symbol,
   a.tstamp AS a_date,
   a.price AS a_price
 ALL ROWS PER MATCH
 PATTERN(A B*)
 DEFINE
   B AS (price < PREV(price))
);

The above example will not run because of the following error:

ORA-00918: column ambiguously defined

00918. 00000 - "column ambiguously defined"
*Cause: 
*Action:
Error at Line: 1 Column: 8

So what is wrong with our code? As MATHGUY pointed out in his reply on stackoverflow.com - quite a lot actually! Let’s start by differentiating between “won’t run” and “wrong”. The ORA-918 error is easy to resolve if you stare long enough at the ORDER BY clause! It’s pointless to include the SYMBOL column as the partition by key and the order by key. If we change the ORDER BY clause as shown here then the code will run:

SELECT symbol, tstamp, price 
FROM ticker 
MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
   a.symbol AS a_symbol,
   a.tstamp AS a_date,
   a.price AS a_price
 ALL ROWS PER MATCH
 PATTERN(A B*)
 DEFINE
   B AS (price < PREV(price))
);

which returns the following resultset (all 60 rows from our source ticker table):

All rows from the ticker table

No MEASURE clause

Okay so our code is running now what? If you look at the output you will notice that it contains the same rows and columns as the source table. What happens if we omit the MEASURE clause? Well it’s optional so the code should still run…

SELECT symbol, tstamp, price 
FROM ticker 
MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 ALL ROWS PER MATCH
 PATTERN(A B*)
 DEFINE
   B AS (price < PREV(price))
);

and sure enough we get the same resultset (all 60 rows from our source ticker table):

All rows from the ticker table

ALL ROWS PER MATCH vs. ONE ROW

from the above we can assume that there is no need to list the source columns from your input table in the MEASURE clause because they are automatically included in the output. BUT this is ONLY true when you use ALL ROWS PER MATCH. If we change the output control to ONE ROW PER MATCH:

SELECT symbol, tstamp, price
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ONE ROW PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
);

you will now get an error:

ORA-00904: "PRICE": invalid identifier
00904. 00000 - "%s: invalid identifier"
*Cause:
*Action:
Error at Line: 19 Column: 24

because when using ONE ROW PER MATCH the only columns that are automatically returned are those listed in the PARTITION BY clause. Therefore, we need to use either “SELECT * FROM …..” or “SELECT symbol FROM…” to get a working version of our code. Using “SELECT * FROM…” as follows:

SELECT * 
FROM ticker 
MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 ONE ROW PER MATCH
 PATTERN(A B*)
 DEFINE
   B AS (price < PREV(price))
);

actually returns only one column (symbol) from the ticker table:

Only partition by column returned

So what should we include in the MEASURE clause?

Based on the query that was in the original post I think the following syntax would make it easier to understand what is happening within the pattern matching process and provide useful information about the data that matches the pattern:

SELECT symbol, tstamp, price, first_date, last_date, first_price, last_price, m_n, classi 
FROM ticker 
MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
FIRST(b.tstamp) AS first_date,
LAST(b.tstamp) AS last_date,
FIRST(b.price) AS first_price,
LAST(b.price) AS last_price,
match_number() AS m_n,
 classifier() AS classi
 ALL ROWS PER MATCH
 PATTERN(A B*)
 DEFINE
   B AS (price < PREV(price))
);

which gives us the following output based on ALL ROWS PER MATCH:

Output from Amended Measure Clause

and if we want to switch to using ONE ROW PER MATCH then we need to remove references to the columns tstamp and price and replace them with references to the pattern variable specific versions, or we can just remove the references all together. In this case as we only have two pattern variables we can NVL the references to return the required data:

SELECT symbol, o_tstamp, o_price, first_date, last_date, first_price, last_price, m_n, classi 
FROM ticker 
MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
 nvl(a.tstamp, b.tstamp) as o_tstamp,
 nvl(a.price, b.price) as o_price,
   FIRST(b.tstamp) as first_date,
   LAST(b.tstamp) as last_date,
   FIRST(b.price) as first_price,
   LAST(b.price) as last_price,
   match_number() as m_n,
   classifier() as classi
 ONE ROW PER MATCH
 PATTERN(A B*)
 DEFINE
   B AS (price < PREV(price))
);

which generates roughly the same output as the previous statement except that this time we are referencing specific instances of tstamp and price.

New Measures For TSTAMP and PRICE

Summary

What have we learned:

Point 1: Check your PARTITION BY and ORDER BY clauses to ensure they make sense!
Point 2: there is no need to list the source columns from your input table in the MEASURE clause because they are automatically included BUT ONLY when you use ALL ROWS PER MATCH.
Point 3: Decide on your output method and match the columns listed in the SELECT clause with those returned by either ALL ROWS PER MATCH or ONE ROW PER MATCH.
Point 4: Always a good idea to check your pattern is being applied correctly by using the built-in MATCH_NUMBER and CLASSIFIER() measures.

For more general information about how to get started with MATCH_RECOGNIZE follow these links to previous blog posts:

and checkout the growing library of tutorials on liveSQL.oracle.com. Hope this helps to throw some light on the workings of the MEASURE clause within MATCH_RECOGNIZE.

Technorati Tags: Analytics, Database 12c, Pattern Matching, SQL, SQL Analytics

↧

Dealing with very very long string lists using Database 12.2

January 18, 2017, 5:13 am

≫ Next: How to intelligently aggregate approximations

≪ Previous: MATCH_RECOGNIZE - What should I include in the MEASURE clause?

Lots Of Pieces String

Oracle RDBMS 11gR2 introduced the LISTAGG function for working with string values. It can be used to aggregate values from groups of rows and return a concatenated string where the values are typically separated by a comma or semi-colon - you can determine this yourself within the code by supplying your own separator symbol.

Based on the number of posts across various forums and blogs, it is widely used by developers. However, there is one key issue that has been highlighted by many people: when using LISTAGG on data sets that contain very large strings it is possible to create a list that is too long. This causes the following overflow error to be generated:

ORA-01489: result of string concatenation is too long.

Rather annoyingly for developers and DBAs, it is very difficult to determine ahead of time if the concatenation of the values within the specified LISTAGG measure_expr will cause an ORA-01489 error. Many people have posted workarounds to resolve this problem - including myself. Probably the most elegant and simple solution has been to use the 12c MATCH_RECOGNIZE feature, however, this required use of 12c Release 1 which was not always available to all DBAs and/or developers.

If you want to replicate the problem and you have access to the sample SH schema then try executing this query:

SELECT
g.country_region,
LISTAGG(c.cust_first_name||''||c.cust_last_name, ',') WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

All the samples in this post use our sample SH schema. Once we release the on-premise version of 12.2 you will be able to download the Examples file for your platform from the database home page on OTN. I do have a tutorial ready-to-go on LiveSQL, however, there is currently a technical issue with using very very very long strings - essentially, running my LISTAGG workshop generates a ‘Data value out of range’ error so as soon as it’s fixed I will update this blog post with a link to the tutorial.

What have we changed in 12.2?

One way of resolving ORA-01489 errors is to simply increase the size of VARCHAR2 objects.

Larger object sizes

The size limit for VARCHAR2 objects is determined by the database parameter MAX_STRING_SIZE. You can check the setting in your database using the following command:

show parameter MAX_STRING_SIZE

in my demo environment this returns the following:

NAME TYPE VALUE
--------------- ------ --------
max_string_size string STANDARD

Prior to Oracle RDBMS 12.1.0.2 the upper limit for VARCHAR2 was 4K. With Oracle RDBMS 12.1.0.2 this limit has been raised to 32K. This increase may solve a lot of issues but it does require a change to the database parameter MAX_STRING_SIZE. By setting MAX_STRING_SIZE = EXTENDED this enables the new 32767 byte limit.

ALTER SYSTEM SET max_string_size=extended SCOPE= SPFILE;

However, with the increasing interest in big data sources it is clear that there is still considerable potential for ORA-01489 errors as you use the LISTAGG feature within queries against extremely large data sets.

What is needed is a richer syntax within the LISTAGG function and this has now been implemented as part of Database 12c Release 2.

Better list management

With 12.2 we have made it easier to manage to lists that are likely to generate an error because they are too long. There are a whole series of new keywords that can be used:

ON OVERFLOW ERROR
ON OVERFLOW TRUNCATE
WITH COUNT vs. WITHOUT COUNT

Let’s look a little closer at each of these features…..

1. Keeping Pre-12.2 functionality

If you want to your existing code to continue to return an error if the string is too long then the great news is that this is the default behaviour. When the length of the LISTAGG string exceeds the VARCHAR2 limit then the standard error will be returned:

ERROR at line xxx:
ORA-01489: result of string concatenation is too long

However, where possible I would recommend adding “ON OVERFLOW ERROR” to your LISTAGG code to make it completely clear that you are expecting an error when an overflow happens:

SELECT
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

So it’s important to note that by default the truncation features are disabled and you will need to change any existing code if you don’t want to raised an error.

2. New ON OVERFLOW TRUNCATE… keywords

If you want to truncate the list of values at the 4K or 32K boundary then you need to use the newly added keywords ON OVERFLOW TRUNCATE as shown here:

SELECT
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW TRUNCATE) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

when truncation occurs we will truncate back to the next full value at which point you can control how you tell the user that the list has been truncated. By default we append three dots ‘…’ to the string as indicator that truncation has occurred but you can override this as follows:

SELECT
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***’) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

If you want to keep the existing pre-12.2 behaviour where we return an error if the string is too long then you can either rely on the default behaviour or explicitly state that an error should be returned (always a good idea to avoid relying on default behaviour in my opinion) by using the keywords:

SELECT
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

which will now generate the normal error message - i.e. replicates the pre-12.2 behaviour:

ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

of course you can simply omit the new keywords and get the same behaviour:

SELECT
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,') WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

which, as before, generates the normal error message - i.e. replicates the pre-12.2 behaviour:

ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

3. How many values are missing?

If you need to know how many values were removed from the list to make it fit into the available space then you can use the keywords ‘WITH COUNT’ - this is the default behaviour. Alternatively if you don’t want a count at the end of the truncated string you can use the keywords ‘WITHOUT COUNT’, which is the default behaviour.

SELECT
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***’ WITH COUNT) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

4. Do we split values when truncation occurs?

No. When determining where to force the truncation we take into account the full length of each value. Therefore, if you consider the example that we have been using which creates a list of customer names within each country we will always include the customer full name “Keith Laker” (i.e. first name + last name). There has to be enough space to add the complete string (first+last name) to the list otherwise the whole string, “Keith Laker” is removed and the truncation indicator is inserted. It’s not possible for the last value in the string to be only the first name where the last name has been truncated/removed.

5. How do we calculate the overall length of the string values?

The characters to indicate that an overflow has occurred are appended at the end of the list of values, which in this case if the default value of three dots “. . .”. The overflow functionality traverses backwards from the maximum possible length to the end of the last complete value in the LISTAGG clause, then it adds the user-defined separator followed by the user defined overflow indicator, followed by output from the ’WITH COUNT’ clause which adds a counter at the end of a truncated string to indicate the number of values that have been removed/truncated from the list.

Summary

With Database 12c Release 2 we have tackled the ORA-01489 error in two ways: 1) increased the the size of VARCHAR2 objects to 32K and 2) extended functionality of LISTAGG to allow greater control over the management of extremely long lists. Specifically there are several new keywords:

ON OVERFLOW TRUNCATE
ON OVERFLOW ERROR (default behaviour)
WITH COUNT
WITHOUT COUNT (default behaviour)

Hopefully this new functionality will mean that all those wonderful workarounds for dealing with “ORA-01489: result of string concatenation is too long“ errors that have been created over the years can now be replaced by standard SQL functionality.

Technorati Tags: Analytics, Database 12c, SQL, SQL Analytics

↧

How to intelligently aggregate approximations

January 23, 2017, 9:17 am

≫ Next: It's out now - Database 12c Release 2 available for download

≪ Previous: Dealing with very very long string lists using Database 12.2

Simple abacus with coloured beads

The growth of low-cost storage platforms has allowed many companies to actively seeking out new external data sets and combine them with internal historical data that goes back over a very long time frame. Therefore, as both the type of data and the volume of data continue to grow the challenge for many businesses is how to process this every expanding pool of data and at the same time, make timely decisions based on all the available data.

(Image above courtesy of http://taneszkozok.hu/)

In previous posts I have discussed whether an approximate answer just plain wrong and if approximate answers are the best way to analyze big data. As with the vast majority of data analysis at some point there is going to be a need to aggregate a data set to get a higher level view across various dimensions. When working with approximation result sets, dealing with aggregations can get a little complicated because it is not possible to “reuse” approximate a result set to aggregate data to higher levels across the various dimensions of the original query. To obtain a valid approximate result set requires a query to rescan the source data and compute the required analysis for the given combination of levels. Just because I have a result set that contains a count of the number of unique products sold this week at the county level does not mean that I can simply reuse that result set to determine the number of distinct products sold this week at the state level. In many cases you cannot just rollup aggregations of aggregations..

Cannot aggregate data from existing approximate result sets

Until now….

With Database 12c Release 2 we have introduced a series of new functions to deal with this specific issue - the need to create reusable aggregated results that can be “rolled-up” to higher aggregate levels. So at long last you can now intelligently aggregate approximations! Here is how e do it….Essentially there are three parts that provide the “intelligence”:

APPROX_xxxxxx_DETAIL
APPROX_xxxxxx_AGG
TO_APPROX_xxxxxx

Here is a quick overview of each function:

APPROX_xxxxxx_DETAIL
This function takes a numeric expression and builds summary result set containing results for all dimensions in GROUP BY clause. The output from this function is a column containing BLOB data. As with other approximate functions the results can be deterministic or non-deterministic depending on your requirements.
APPROX_xxxxxx_AGG
This function builds a higher level summary table based on results from the _DETAIL function. This means that it is not necessary to re-query base fact table in order to derive new aggregates results. As with the _DETAIL function, the results are returned as a blob.
TO_APPROX_xxxxxx
Returns results from the _AGG and _DETAIL functions in user readable format.

Three new functions for managing approximate aggregations

A worked example

Let’s build a working example using the sample SH schema. Our product marketing team wants to know within each year the approximate number of unique customers within each product category. Thinking ahead we know that once they have this result set we expect them to do further analysis such as drilling on the time and product dimension levels to get deeper insight. The best solution is to build a reusable aggregate approximate result set using the new functions in 12.2.

SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;

this returns my result set as a blob as shown below and this blob contains the various tuple combinations from my GROUP BY clause. As a result I can reuse this result set to answer new questions based around higher levels of aggregation.

Aggregated results from query contained within BLOB column

and the explain plan for this query shows the new sort keywords (GROUP BY APPROX) which tells us that approximate processing has been used as part of this query.

Explain plan showing new sort keywords

If we want to convert the BLOB data into a readable format we can transform it by using the TO_APPROX_xxx function as follows:

SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_DETAIL(s.cust_id)) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;

this creates the following results

shows use of TO_APPROX function to transform blob data

alternatively we coult create a table using the above query and then simply pass the BLOB column directly into the TO_APPROX function as follows:

CREATE TABLE agg_cd AS
SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;

using this table we can simplify our query to return the approximate number of distinct customers directly from the above table:

SELECT
calendar_year,
qtr,
category_desc,
subcategory_desc,
TO_APPROX_COUNT_DISTINCT(acd_agg)
FROM agg_cd
ORDER BY calendar_year, qtr, category_desc, subcategory_desc;

which returns the same results as before - as you would expect!

using TO_APPROX function directly against BLOB colum

using the aggregated table as our source we can now change the levels that we wish to calculate without having to go back to the original source table and again scan all the rows. However, to extract the new aggregations we need to introduce the third function APPROX_COUNT_DISTINCT_AGG to our query and wrap this within the TO_APPROX_COUNT_DISTINCT function to see the results:

SELECT
calendar_year,
subcategory_desc,
TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_AGG(acd_agg))
FROM agg_cd
GROUP BY calendar_year, subcategory_desc
ORDER BY calendar_year, subcategory_desc;

will return the following results based only on the new combination of levels included in the GROUP BY clause:

Using higher level aggregations

Summary

This post has reviewed the three new functions that we have introduced in Database 12c Release 2 that allow you to reuse aggregated approximate result sets:

APPROX_xxxxxx_DETAIL
APPROX_xxxxxx_AGG
TO_APPROX_xxxxxx

Database 12c Release 2 makes it possible to intelligently aggregate approximations. In the next post I will explore how you can combine approximate processing with existing query rewrite functionality so you can have intelligent approximate query rewrite.

Technorati Tags: Analytics, Database 12c, Oracle Database 12c, SQL, SQL Analytics

↧

It's out now - Database 12c Release 2 available for download

March 2, 2017, 3:53 am

≫ Next: MATCH_RECOGNIZE: Can I use MATCH_NUMBER() as a filter?

≪ Previous: How to intelligently aggregate approximations

Database 12c Release 2 available for download

Yes, it’s the moment the world has been waiting for: the latest generation of the world’s most popular database, Oracle Database 12c Release 2 (12.2) is now available everywhere - in the Cloud and on-premises. You can download this latest version from the database home page on OTN - click on the Downloads tab.

So What’s New in 12.2 for Data Warehousing?

This latest release provides some incredible new features for data warehouse and big data. If you attended last year’s OpenWorld event in San Francisco then you probably already know all about the new features that we have added to 12.2 - checkout my blog post from last year for a comprehensive review of #oow16:

Blog: The complete review of data warehousing and big data content from Oracle OpenWorld 2016

If you missed OpenWorld and if you are a data warehouse architect, developer or DBA then here are the main feature highlights 12.2 with links to additional content from OpenWorld and my data warehouse blog:

General Database Enhancements

1) Partitioning

Partitioning: External Tables
Partitioned external tables provides both the functionality to map partitioned Hive tables into the Oracle Database ecosystem as well as providing declarative partitioning on top of any Hadoop Distributed File System (HDFS) based data store.
Partitioning: Auto-List Partitioning
The database automatically creates a separate (new) partition for every distinct partition key value of the table.
Auto-list partitioning removes the management burden from the DBAs to manually maintain a list of partitioned tables for a large number of distinct key values that require individual partitions. It also automatically copes with the unplanned partition key values without the need of a DEFAULT partition.
Partitioning: Read-Only Partitions
Partitions and sub-partitions can be individually set to a read-only state. This then disables DML operations on these read-only partitions and sub-partitions. This is an extension to the existing read-only table functionality. Read-only partitions and subpartitions enable fine-grained control over DML activity.
Partitioning: Multi-Column List Partitioning
List partitioning functionality is expanded to enable multiple partition key columns. Using multiple columns to define the partitioning criteria for list partitioned tables enables new classes of applications to benefit from partitioning.

For more information about partitioning see:

#OOW16 - Oracle Partitioning: Hidden Old Gems and Great New Tricks, by Hermann Baer, Senior Director Product Management
Partitioning home page on OTN

2) Parallel Execution

Parallel Query Services on RAC Read-Only Nodes
Oracle parallel query services on Oracle RAC read-only nodes represents a scalable parallel data processing architecture. The architecture allows for the distribution of a high number of processing engines dedicated to parallel execution of queries.

For more information about parallel execution see:

#OOW16 - The Best Way to Tune Your Parallel Statements: Real-Time SQL Monitoring by Yasin Baskan, Senior Principal Product Manager
Parallel Execution home page on OTN

Schema Enhancements

Dimensional In-Database analysis with Analytic Views
Analytic views provide a business intelligence layer over a star schema, making it easy to extend the data set with hierarchies, levels, aggregate data, and calculated measures. Analytic views promote consistency across applications. By defining aggregation and calculation rules centrally in the database, the risk of inconsistent results in different reporting tools is reduced or eliminated.

The analytic view feature includes the new DDL statements, such as CREATE ATTRIBUTE DIMENSION, CREATE HIERARCHY and CREATE ANALYTIC VIEW, new calculated measure expression syntax, and new data dictionary views.These analytic views allow data warehouse and BI developers to extend the star schema with time series and other calculations eliminating the need to define calculations within the application. Calculations can be defined in the analytic view and can be selected by including the measure name in the SQL select list.

For more information about Analytic Views see:

#OOW16 - Analytic Views: A New Type of Database View for Simple, Powerful Analytics by Bud Endress, Director, Product Management

SQL Enhancements

Cursor-Duration Temporary Tables Cached in Memory
Complex queries often process the same SQL fragment (query block) multiple times to answer a question. The results of these queries are stored internally, as cursor-duration temporary tables, to avoid the multiple processing of the same query fragment. With this new functionality, these temporary tables can reside completely in memory avoiding the need to write them to disk. Performance gains are the result of the reduction in I/O resource consumption.
Enhancing CAST Function With Error Handling
The existing CAST function is enhanced to return a user-specified value in the case of a conversion error instead of raising an error. This new functionality provides more robust and simplified code development.
New SQL and PL/SQL Function VALIDATE_CONVERSION
The new function, VALIDATE_CONVERSION, determines whether a given input value can be converted to the requested data type. The VALIDATE_CONVERSION function provides more robust and simplified code development.
Enhancing LISTAGG Functionality
LISTAGG aggregates the values of a column by concatenating them into a single string. New functionality is added for managing situations where the length of the concatenated string is too long. Developers can now control the process for managing overflowing LISTAGG aggregates. This increases the productivity and flexibility of this aggregation function.
Approximate Query Processing
This release extends the area of approximate query processing by adding approximate percentile aggregation. With this feature, the processing of large volumes of data is significantly faster than the exact aggregation. This is especially true for data sets that have a large number of distinct values with a negligible deviation from the exact result.

Approximate query aggregation is a common requirement in today's data analysis. It optimizes the processing time and resource consumption by orders of magnitude while providing almost exact results. Approximate query aggregation can be used to speed up existing processing.
Parallel Recursive WITH Enhancements
Oracle Database supports recursive queries through the use of a proprietary CONNECT BY clause, and an ANSI compliant resursive WITH clause. The parallel recursive WITH clause enables this type of query to run in parallel mode. These types of queries are typical with graph data found in social graphs, such as Twitter graphs or call records and commonly used in transportation networks (for example, for flight paths, roadways, and so on).
Recursive WITH ensures the efficient computation of the shortest path from a single source node to single or multiple destination nodes in a graph. Bi-directional searching is used to ensure the efficient computation of the shortest path from a single source node to single or multiple destination nodes in a graph. A bi-directional search starts from both source and destination nodes, and then advancing the search in both directions.

For more information about the new data warehouse SQL enhancements see:

#OOW16 - Oracle Database 12c Release 2: Top 10 Data Warehouse Features for Developers and DBAs by Keith Laker, Senior Principal Product Manager,
SQL for Analysis, Reporting and Modelinghome page on OTN.
Blog: Is an approximate answer just plain wrong?
Blog: Are approximate answers the best way to analyze big data
Blog: How to intelligently aggregate approximations
Blog: Dealing with very very long string lists using Database 12.2
Blog: Simplifying your data validation code with Database 12.2
Blog: My query just got faster - new in 12.2: in-memory temp tables (coming soon!) on oracle-big-data.blogspot.co.uk

In addition to the above features, we have made a lot of enhancements and added new features to the Optimizer and there is a comprehensive review by Nigel Bayliss, senior principle product manager, available on the optimizer blog. Obviously, the above is my take on what you need to know about for 12.2 and it’s not meant to be an exhaustive list of all the data warehouse and big data features. For the complete list of all the new features in 12.2 please refer to the New Features Guide in the database documentation set.

I would really like to thank my amazing development team for all their hard work on the above list of data warehouse features and the all the time they have spent proof-reading and fact-checking my blog posts on these new features.

Enjoy using this great new release and checkout all the 12.2 tutorials and scripts on livesql!

Technorati Tags: Analytics, Database 12c, Oracle Database 12c, SQL, SQL Analytics

↧

MATCH_RECOGNIZE: Can I use MATCH_NUMBER() as a filter?

March 7, 2017, 8:47 am

≫ Next: Sneak preview of demo for Oracle Code events

≪ Previous: It's out now - Database 12c Release 2 available for download

Recently I spotted a post on OTN that asked the question: Can MATCH_RECOGNIZE skip out of partition? This requires a bit more detail because it raises all sorts of additional questions. Fortunately the post included more information which went something like this:

after a match is found I would like match_recognize to stop searching - I want at most one match per partition. I don’t want to filter by MATCH_NUMBER() in an outer query - that is too wasteful (or, in some cases, I may know in advance that there is at most one match per partition, and I don’t want match_recognize to waste time searching for more matches which I know don't exist).

Can MATCH_RECOGNIZE do this? Short answer is: NO.
Long answer is: Still NO.

Going back to the original question… you could interpret it as asking “is it possible to only return the first match”? The answer to this question is YES, it is possible.

There are a couple of different ways of doing it. Let’s use our good old “TICKER” data set. The point of this exercise is to simply show that there are different ways to achieve the same result but in the end you need to look beyond what is passed back from MATCH_RECOGNIZE to understand what is going on as we process the rows from our TICKER table…

In simple terms, I only want my query to return the first match. Here is my starting query:

SELECT 
 symbol,
 tstamp,
 price,
 match_number,
 classifier,
 first_x,
 last_y 
FROM ticker
MATCH_RECOGNIZE (
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
   FIRST(x.tstamp) AS first_x,
   LAST(y.tstamp) AS last_y,
   MATCH_NUMBER() AS match_number,
 CLASSIFIER() AS classifier
 ALL ROWS PER MATCH WITH UNMATCHED ROWS
 AFTER MATCH SKIP PAST LAST ROW
 PATTERN (strt X+ Y)
 DEFINE 
   X AS (price <= PREV(price)),
   Y AS (price >= PREV(price))
)
ORDER BY symbol, match_number, tstamp asc;

This is the output from the above query which shows that for each symbol we have multiple matches of our pattern:

Starting Query

Returning only the 1st match

If we want to return just the first match it is simply a matter of using applying a filter on MATCH_NUMBER() as follows:

SELECT 
 symbol,
 tstamp,
 price,
 match_number,
 classifier,
 first_x,
 last_y 
FROM ticker
MATCH_RECOGNIZE (
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
   FIRST(x.tstamp) AS first_x,
   LAST(y.tstamp) AS last_y,
   MATCH_NUMBER() AS match_number,
   CLASSIFIER() AS classifier
 ALL ROWS PER MATCH WITH UNMATCHED ROWS
 AFTER MATCH SKIP PAST LAST ROW
 PATTERN (strt X+ Y)
 DEFINE 
  X AS (price <= PREV(price)),
  Y AS (price >= PREV(price))
)
WHERE match_number = 1
ORDER BY symbol, match_number, tstamp asc;

which returns the desired results:
Filter Query

BUT have we saved any processing? That is to say: did MATCH_RECOGNIZE stop searching for matches after the first match was found? NO! Checking the explain plan we can see that all 60 rows from our table where processed:

Anyway the original post pointed out that simply filtering was not what they wanted so we can discount using MATCH_NUMBER within the WHERE clause. Although it does sort of achieve the result we wanted.

Let’s try an alternative approach. Can we limit the number of rows that are processed by using the exclude syntax within the PATTERN clause?

SELECT 
 symbol,
 tstamp,
 price,
 match_number,
 classifier,
 first_x,
 last_y 
FROM ticker
MATCH_RECOGNIZE (
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
   FIRST(x.tstamp) AS first_x,
   LAST(y.tstamp) AS last_y,
   MATCH_NUMBER() AS match_number,
   CLASSIFIER() AS classifier
 ALL ROWS PER MATCH
 AFTER MATCH SKIP PAST LAST ROW
 PATTERN (strt X+ Y c*)
 DEFINE 
   X AS (price <= PREV(price)),
   Y AS (price >= PREV(price))
)
ORDER BY symbol, match_number, tstamp asc;

I have added another pattern variable “c” but made it always true by not providing a definition within the DEFINE clause.

Single Match for each symbol

This is getting close to what we might need because now we have only one match for each symbol. Therefore, if we now use the exclude syntax around the pattern variable c we should be able to remove all matches except the first!

SELECT 
 symbol,
 tstamp,
 price,
 match_number,
 classifier,
 first_x,
 last_y 
FROM ticker
MATCH_RECOGNIZE (
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
   FIRST(x.tstamp) AS first_x,
   LAST(y.tstamp) AS last_y,
   MATCH_NUMBER() AS match_number,
   CLASSIFIER() AS classifier
 ALL ROWS PER MATCH
 AFTER MATCH SKIP PAST LAST ROW
 PATTERN (strt X+ Y {-c*-})
 DEFINE 
   X AS (price <= PREV(price)),
   Y AS (price >= PREV(price))
)
ORDER BY symbol, match_number, tstamp asc;

which does in fact return exactly the same result as our second query where we applied a filter on the column MATCH_NUMBER:

Filter Query

but if we check the explain plan we can see that yet again all 60 rows were processed.

Therefore, we have got the right result but we have not been able to actually halt the MATCH_RECOGNIZE processing after the first match has been found.

Returning only the 2nd match

What if we wanted to return only the 2nd match? Well for this use case the exclude syntax is not going to work. The only viable solution in this situation would be to using the match_number column and apply a filter to find the required match. However, all rows from input table will be processed!

…and the final answer is: Enhancement Request

Let’s start with the simple answer to our original problem: after a match is found I would like match_recognize to stop searching
Alas, there is definitely no way to stop MATCH_RECOGNIZE processing all the rows passed to it. To make this happen we would need to extend the AFTER MATCH SKIP TO syntax to include phrases that let us call a halt to the pattern matching process. What we need is something like “AFTER MATCH SKIP TO END”, however, this assumes that only the first match is important.

What if you wanted the first and the second or maybe it’s the second match that’s of most interest. What we really need then is something like the following: “AFTER MATCH ’N’ SKIP TO END” where ’N’ indicates the maximum number of matches that you want to process before jumping to the end of the partition.

Assuming I can find enough valid use cases I will put this on the “enhancement” list for MATCH_RECOGNIZE. If you some great use cases for this scenario then please send me the details (keith.laker@oracle.com).

Technorati Tags: Analytics, Data Warehousing, Database 12c, Pattern Matching

↧

Sneak preview of demo for Oracle Code events

March 14, 2017, 3:58 am

≫ Next: My query just got faster - brief introduction to 12.2 in-memory cursor duration temp tables

≪ Previous: MATCH_RECOGNIZE: Can I use MATCH_NUMBER() as a filter?

I will be presenting at a number of the Oracle Code events over the coming months on the subject of…..(drum roll please) SQL pattern matching. Oracle Code is a great series of conferences dedicated to developers who want to get the absolute maximum benefit from using today's cutting edge technologies. If you want to register for any of the dates listed below then follow this link to the registration page.

North and Latin America

San Francisco

March 1, 2017

Austin

March 8, 2017

New York City

March 21, 2017

Washington DC

March 27, 2017

Toronto

April 18, 2017

Atlanta

June 22, 2017

Sao Paulo

June 27, 2017

Mexico City

June 29, 2017

Europe and Middle East

London

April 20, 2017

Berlin

April 24, 2017

Prague

April 28, 2017

Moscow

May 22, 2017

Brussels

June 6, 2017

Tel Aviv

July 11, 2017

Asia

New Delhi

May 10, 2017

Tokyo

May 18, 2017

Beijing

July 14, 2017

Sydney

July 18, 2017

Seoul

August 30, 2017

Bangalore

August 4, 2017

Back to my session...the actual topic of my session is: Simplified and fast fraud detection. The overall aim of this session is to demonstrate the key benefits of using SQL row pattern matching techniques compared to using other programming languages. As part of the presentation I will be using live demos to review a specific use related to fraud detection where I will walk through the MATCH_RECOGNIZE clause and explain the concepts and keywords. The demo will use a simple five step framework to construct pattern matching queries.

I will aim to show how easy it is to write and amend SQL-based pattern matching queries as requirements change and the slide deck includes a link to the video from OpenWorld 2015 where we showed a sophisticated fraud detection application that used pattern matching and spatial to create an analytical mash-up that processed a data stream in real-time.
If you would like a sneak preview of the presentation and the demo then follow these links:

Demo: Simplified SQL for fraud detection: livesql.oracle.com - if you have never used LiveSQL then you can get more information here, it’s a fantastic service and it’s completely FREE
Slides:Simplified and fast fraud detection is now available on the OTN SQL Analytics “Learn More” page.

If you are planning to attend any of the above events then please let me know as it would be great to meet up and talk about your experiences with analytic SQL and, especially, pattern matching. Email is the usual address keith.laker@oracle.com, or you can ping me on Twitter - @ASQLBarista.

Technorati Tags: Analytics, Conference, Data Warehousing, Database 12c, Pattern Matching, SQL

↧

My query just got faster - brief introduction to 12.2 in-memory cursor duration temp tables

March 21, 2017, 7:31 am

≫ Next: Using Zeppelin Notebooks with your Oracle Data Warehouse

≪ Previous: Sneak preview of demo for Oracle Code events

This post covers one of the new SQL performance enhancements that we incorporated into Database 12c Release 2. All of these enhancements are completely automatic, i.e. transparent to the calling app/developer code/query. These features are enabled by default because who doesn’t want their queries running faster with zero code changes?

So in this post I am going to focus on the new In-Memory “cursor duration” temporary table feature. Let’s start by looking at cursor duration temp tables…

Above image courtesy of wikimedia.org

What is a cursor duration temp table?

This is a feature that has been around for quite a long time. Cursor duration temporary tables (CDTs) are used to materialize intermediate results of a query to improve performance or to support complex multi step query execution. The following types of queries commonly use cursor duration temp tables:

What happens during the query execution process, assuming CDTs are used to materialize intermediate results, is that a temporary table is created during compilation and the query is rewritten to refer to the temporary table(s). Before the actual query executes, the temporary tables are populated using “Insert Direct Load”. The required temporary segments are allocated when the data is loaded and the segment information is managed within the session for the duration of the query. Obviously, the scope of data in the CDT is private to the specific query execution. There is no exposed mechanism available today that allows you to view the data within a CDT except through tracing.

So what is an In-Memory cursor duration temp table?

An In-Memory cursor duration temp table (IMCDTt) is simply where the data for a CDTs is stored in memory, which results in a “significant” performance boost for queries that make multiple passes over a data set or need to materialise intermediate results. The switch to using in-memory tables means that your queries should see a reduction in I/O since each pass-over the data set does not incur additional I/O operations to access that data.
There maybe times when there is insufficient memory available to load all the data, so what happens in these situations? When this does happen then we use local (private) temp segments will be allocated to store the excess data and when an in-memory cursor duration temp table is queried it will query both the memory and the private temp segments to return the data.

Will all my queries use this new feature?

Our internal algorithms determine when and how this new feature is used. At the moment only serial queries that make multiple passes over a data set or queries that need to materialize intermediate results will use this feature. Don’t be concerned if your queries do not end up using these new in-memory cursor duration temp tables. There is nothing you can do to force their use within a query. The point of this post is to simply make you aware of a new term that could, potentially, appear in your explain plans. Obviously, going forward we will explore the possibility of expanding the scope of this feature to cover other types of queries.

Do I need to license database in-memory option?

No. There is no need to license the costed Database In-Memory option. Of course, if you are using the Exadata and Database Cloud services then really useful analytical options such as advanced analytics (data mining, machine learning), spatial and graph analytics and in-memory are included in most of the service prices. If that isn’t good enough reason to move to the Oracle Cloud then I don’t know what is!

Where does the memory come from?

In broad general terms the memory used by In-Memory “cursor duration” temporary tables comes from the PGA pool. Does this mean you might need to increase the size of your PGA memory area? As usual, “it depends…..” on a lot of different factors including whether you have a lot of queries today that use cursor duration temporary tables and which are likely to switch over to using the new in-memory cursor duration temporary tables. All I can say is: monitor your usage of PGA and determine if you need to increase the size of your PGA because you are running out of resources. Of course, if a query cannot allocate sufficient memory to use in-memory cursor duration” temporary tables it will simply revert back to using the pre-12.2 cursor duration” temporary tables.
That’s the background stuff all covered so now we can look at a couple of SQL code examples to see how this all works in practice.

Sample Schema

Let’s look at a simple query using the sales history sample schema (to download this sample schema goto the database downloads page on OTN, find your operating system and click on the “See All” link, then scroll down until you find the download link for the “Oracle Database xxxxx Examples”. Finally, follow the installation instructions in the Oracle Documentation to install the SH schema into your database instance).
Alternatively you can access the sales history schema using our new web-based LiveSQL. You will need to create an account if you don’t already have one.

Example: GROUP BY with GROUPING SETS

What we want to find is the total revenue for each product and promotion during the period 01-Jan-2000 and 31-Dec-2000 along with the total sales in each channel. We can do this by using the GROUP BY GROUPING SETS feature (a very big topic for another day). Here’s the query we need to run:

SELECT /*+MONITOR */
 p.prod_category, 
 x.promo_id,
 c.channel_class,
 t.calendar_quarter_desc,
 s.time_id,
 SUM(s.amount_sold)
FROM sales s, products p, promotions x, channels c, times t
WHERE s.time_id 
      BETWEEN to_date('1998-01-01', 'YYYY-MM-DD') 
      AND to_date('2000-12-31', 'YYYY-MM-DD')
AND p.prod_id = s.prod_id
AND x.promo_id = s.promo_id
AND c.channel_id = s.channel_id
AND t.time_id = s.time_id
GROUP BY GROUPING SETS 
 ((p.prod_category, x.promo_id), 
  (c.channel_class, t.calendar_quarter_desc), 
  (p.prod_category, t.calendar_quarter_desc), 
  (s.time_id, t.calendar_quarter_desc))
ORDER BY 1,2,3,4,5

Just to add some additional functionality to this example I am going to include a hint for parallel query and a hint that will allow me to access the real-time SQL monitoring within SQLDeveloper - for more information about this really great feature read this blog post by ThatJeffSmith: On Real-Time SQL Monitoring and the /*+MONITOR*/ Hint.
The examples below are not going to showing huge performance gains simply because I am using a very small data set (the sales history sample schema) and I am running the queries in two separate VirtualBox machines on my laptop and the Big Data Lite image is also running a lot of other features, such as a complete Hadoop environment. Therefore, you just need to stay focused on the changes to I/O in the examples below.

Part A, Pre 12-2: using cursor duration temp tables

Using the GROUPING SETS feature we will create totals for each specific combination of product category, promo_id, channel class along with quarters and months. The output is quite long because there are quite a few permutations of dimension members even in the sales history schema, so here is first block of rows returned into SQL Developer so you can get a sense of the structure of the data set that is returned. This query is running in a 12.1 Database (I am actually using one of the current release of the Big Data Lite VirtualBox image that includes Database 12c Release 1). I have truncated the output but you can see most of the results for the first GROUPING SET of prod_id and promo_id. The query does return a lot more rows: 1,205 rows.

Output from Query 1

The monitoring report in SQL Developer for the above query looks like this:

Fullsizeoutput 1007

We can see that there are a number of distinct blocks of work but at the top of the plan we can see the TEMP TABLE TRANSFORMATION reference followed by LOAD AS SELECT with no further keywords, which is expected because we are not using 12.2 to run this query. About half-way down the report you can see the additional LOAD AS SELECT statements against the temp table containing the base level information we need to create the various total combinations within the GROUPING SETS clause.

If you want more information the temp table transformation step then there is an excellent post on the optimizer blog: https://blogs.oracle.com/optimizer/entry/star_transformation. From the plan below you can see that our temp table is then reused during the rest of the query where we reuse the temp table to construct the various totals for each combination of dimension members.

We can see that we are incurring I/O during each phase of the query: we are making 684 I/O requests and 172.6MB of I/O Bytes. Given that I am using two VMs running at the same time I don’t see much point in focusing on the actual execution time. So that’s our plan running in Database 12c Release 1.

Part B, 12.2: using in-memory cursor duration temp tables

Now let’s switch over to using our 12.2 database - I am using the latest developer VirtualBox image that is posted on OTN. Using this evironment we can see that if we re-run our query the result of the query , in terms of rows returned, is the same which is always great news.

Output from Query 1

Let’s take a look at the monitoring report for the same query now running in Database 12c Release 2:

SQL Developer Monitoring Report for 12-2 Query

The first to notice when comparing the two monitoring plans is that we have significantly reduced the amount of I/O in 12.2. In 12.1 our relatively simple grouping set query generations 684 I/O requests and 172.6MB of I/O. Compare that with the data in the monitoring report for the same query running in 12-2 - 40 I/O requests and 9.9MB of I/O. This means that we have managed to improve the overall efficiency of our query by simply upgrading to 12.2.

Obviously your mileage will vary according to the details of the query you are executing but that is a nice resource efficiency and performance boost that has required zero code changes and it’s completely FREE!.

In summary, with this GROUPING SET example we have reduced the amount of I/O and the number of I/O requests through the use of in-memory cursor duration temp tables. As with the previous report, you will see continual references to “LOAD AS SELECT”, however, in 12.2 there is an additional set of keywords which identify the use of the new in-memory cursor duration temp tables:

LOAD AS SELECT ((CURSOR DURATION MEMORY))

In the bottom half of the report you should notice that the above statement covers two additional plan lines HASH (GROUP BY) and TABLE ACCESS (FULL) which reference the temp table object, however, there are no I/O operations - which confirms the use of in-memory cursor duration temp tables.

Summary

This post has covered just one of the many new SQL performance enhancements that we incorporated into Database 12c Release 2. I covered the most important features in my presentation at last year’s OpenWorld. A lot of these enhancements, including In-Memory “cursor duration” temporary tables, are completely automatic, i.e. transparent to the calling app/query and they are enabled by default because who doesn’t want their queries running faster with zero code changes? Within this release of 12.2 we are limiting their use to just serial queries of the types listed at the start of the post.

Just to be absolutely clear - there are no hints or parameters you can set to force the use of In-Memory “cursor duration” temporary tables. Our internal algorithms will determine if this feature is used within your query. If In-Memory “cursor duration” temporary tables are used then you will see the following lines in your explain plans: LOAD AS SELECT (CURSOR DURATION MEMORY)

As I have outlined above, there are definite efficiency benefits to be gained from using this feature due to the reduction in I/O which should also improve overall query performance- although your mileage will vary depending on your particular environment! If you would like to share your experiences of using this new feature then please contact me via email (keith.laker@oracle.com).

↧

Using Zeppelin Notebooks with your Oracle Data Warehouse

April 20, 2017, 8:09 am

≫ Next: Connecting Apache Zeppelin to your Oracle Data Warehouse

≪ Previous: My query just got faster - brief introduction to 12.2 in-memory cursor duration temp tables

Over the past couple of weeks I have been looking at one of the Apache open source projects called Zeppelin. It’s a new style of application called a “notebook” which typically runs within your browser. The idea behind notebook-style applications like Zeppelin is to deliver an adhoc data-discovery tool - at least that is how I see it being used. Like most notebook-style applications, Zeppelin provides a number of useful data-discovery features such as:

a simple way to ingest data
access to languages that help with data discovery and data analytics
some basic data visualization tools
a set of collaboration services for sharing notebooks (collections of reports)

Zeppelin is essentially a scripting environment for running ordinary SQL statements along with a lot of other languages such as Spark, Python, Hive, R etc. These are controlled by a feature called “interpreters” and there is a list of the latest interpreters available here.

A good example of a notebook-type of application is R Studio which many of you will be familiar with because we typically use it when demonstrating the R capabilities within Oracle Advanced Analytics. However, R Studio is primarily aimed at data scientists whilst Apache Zeppelin is aimed at other types of report developers and business users although it does have a lot of features that data scientists will find useful.

Use Cases

What’s a good use case for Zeppelin? Well, what I like about Zeppelin is that you can quickly and easily create a notebook, or workflow, that downloads a log file from a URL, reformats the data in the file and then displays the resulting data set as a graph/table.

Nothing really earth-shattering in that type of workflow except that Zeppelin is easy to install, it’s easy to setup (once you understand its architecture), and it seems to be easy to share your results. Here’s a really simple workflow described above that I built to load data from a file, create an external table over the data file and then run a report:

This little example shows how notebooks differ from traditional BI tools. Each of the headings in the above image (Download data from web url, Create directory to data file location, Drop existing staging table etc etc) is a separate paragraph within the “Data Access Tutorial” notebook.

The real power is that each paragraph can use a different language such as SQL, or java, shell scripting or python etc etc. In the workbook shown above I start by running a shell script that pulls a data file from a remote server. Then using a SQL paragraph I create a directory object to access the data file. The next SQL paragraph drops my existing staging table and the subsequent SQL paragraph creates the external table over the data file. The final SQL paragraph looks like this:

%osql

select * from ext_bank_data

where %osql tells me the language, or interpreter, I am using which in this case is SQL connecting to a specific schema in my database.

Building Dashboards

You can even build relatively simple briefing books containing data from different data sets and even different data sources (Zeppelin supports an ever growing number of data sources) - in this case I connected Zeppelin to two different schemas in two different PDBs:

Screenshot or Zeppelin Dashboard Report

What’s really nice is that I can even view these notebooks on my smartphone (iPhone) as you can see below. The same notebook shown above appears on my iPhone screen in a vertical layout style to make best use of the screen real estate:

Zeppelin Report Running on iPhone7 Plus

I am really liking Apache Zeppelin because it’s so simple to setup (I have various versions running on Mac OSX and Oracle Linux) and start. It has just enough features to be very useful and not overwhelming. I like the fact that I can create notebooks, or reports, using a range of different languages and show data from a range of different schemas/PDBs/database alongside each other. It is also relatively easy to share those results. And I can open my notebooks (reports) on my iPhone.

Visualizations

There is a limited set of available visualizations within the notebook (report) editor when you are using a SQL-based interpreter (connector). Essentially you have a basic, scrollable table and five types of graph to choose for viewing your data. You can interactively change the layout of the graph by clicking on the “settings” link but there are no formatting controls to alter the x or y labels - if you look carefully at the right-hand area graph in the first screenshot you will probably spot that the time value labels on the x-axis overlap each other.

Quick Summary

Now, this may be obvious but I would not call Zeppelin a data integration tool nor a BI tool for reasons that will become clear during the next set of blog posts.

Having said that, overall, Zeppelin is a very exciting and clever product. It is relatively easy to setup connections to your Oracle Database, the scripting framework is very powerful and there are visualization features are good enough. It's a new type of application that is just about flexible enough for data scientists, power users and report writers.

What’s next?

In my next series of blog posts, which I aiming to write over the next couple of weeks, I will explain how to download and install Apache Zeppelin, how to setup connections to an Oracle Database and how to use some of the scripting features to build reports similar to the ones above. If you are comfortable with writing your own shell scripts, SQL scripts, markup scripts for formatting text then Zeppelin is very flexible tool.

If you are already using Zeppelin against your Oracle Database and would like to share your experiences that would be great - please use the comments feature below or feel free to send me an email: keith.laker@oracle.com.
(image at top of post is courtesy of wikipedia)
Technorati Tags: Analytics, Data Warehousing, Database 12c, SQL Analytics

↧

Connecting Apache Zeppelin to your Oracle Data Warehouse

May 17, 2017, 6:58 am

≫ Next: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

≪ Previous: Using Zeppelin Notebooks with your Oracle Data Warehouse

In my last posts I provided an overview of the Apache Zeppelin open source project which is a new style of application called a “notebook”. These notebook applications typically runs within your browser so as an end user there is no desktop software to download and install.

Interestingly, I had a very quick response to this article asking about how to setup a connection within Zeppelin to an Oracle Database. Therefore, in this post I am going to look at how you can install the Zeppelin server and create a connection to your Oracle data warehouse.

This aim of this post is to walk you through the following topics:

Installing Zeppelin
Configuring Zeppelin
What is an interpreter
Finding and installing the Oracle JDBC drivers
Setting up a connection to an Oracle PDB

Firstly a quick warning! There are a couple of different versions of Zeppelin available for download. At the moment I am primarily using version 0.6.2 which works really well. Currently, for some reason I am seeing performance problems with the latest iterations around version 0.7.x and this issue. I have discussed this a few people here at Oracle and we are all seeing the same behaviour - queries will run, they just take 2-3 minutes longer for some unknown reason compared with earlier versions, pre-0.7.x, of Zeppelin.

In the interests of completeness in this post I will cover setting up a 0.6.2 instance of Zeppelin as well as a 0.7.1 instance.

Installing Zeppelin

The first thing you need to decide is where to install the Zeppelin software. You can run on your own PC or on a separate server or on the same server that is running your Oracle Database. I run all my linux based database environments within Virtualbox images so I always install onto the same virtual machine as my Oracle database - makes life easier for moving demos around when I am heading off to user conference.

Step two is to download the software. The download page is here: https://zeppelin.apache.org/download.html.

Simply pick the version you want to run and download the corresponding compressed file - my recommendation, based on my experience, is to stick with version 0.6.2 which was released on Oct 15, 2016. I always select to download the full application - “Binary package with all interpreters” just to make life easy and it also gives me access the full range of connection options which, as you will discover in my next post, is extremely useful.

Installing Zeppelin - Version 0.6.2

After downloading the zeppelin-0.6.2-bin-all.tgz file onto my Linux Virtualbox machine I simply expand the file to create a “zeppelin-0.6.2-bin-all” directory. The resulting directory structure looks like this:

Fullsizeoutput 104a

Of course you can rename the folder name to something more meaningful, such as “my-zeppelin” if you wish….obviously, the underlying folder structure remains the same!

Fullsizeoutput 104c

Installing Zeppelin - Version 0.7.x

The good news is that if you want to install one of the later versions of Zeppelin then the download and unzip process is exactly the same. At this point in time there are two versions of 0.7, however, both 0.7.0 and 0.7.1 seem to suffer from poor query performance when using the JDBC driver (I have only tested the JDBC driver against Oracle Database but I presume the same performance issues are affecting other types of JDBC-related connections). As with the previous version of Zeppelin you can, if required, change the default directory name to something more suitable.
Now we have our notebook software unpacked and ready to go!

Configuring Zeppelin (0.6.2 and 0.7.x)

This next step is optional. If you have installed the Zeppelin software on the same server or virtual environment that runs your Oracle Database then you will need to tweak the default configuration settings to ensure there are no clashes with the various Oracle Database services. By default, you access the Zeppelin Notebook home page via the port 8080. Depending on your database environment this may or may not cause problems. In my case, this port was already being used by APEX, therefore, it was necessary to change the default port…

Configuring the Zeppelin http port

If you look inside the “conf” directory there will be a file named “zeppelin-site.xml.template”, rename this to “zeppelin-site.xml”. Find the following block of tags:

<property>
<name>zeppelin.server.port</name>
<value>8080</value>
<description>Server port.</description>
</property>

the default port settings in the conf file will probably clash with the APEX environment in your Oracle Database. Therefore, you will need to change the port setting to another value, such as:

<property>
<name>zeppelin.server.port</name>
<value>7081</value>
<description>Server port.</description>
</property>

Save the file and we are ready to go! It is worth spending some time reviewing the other settings within the conf file that let you use cloud storage services, such as the Oracle Bare Metal Cloud Object Storage service. For my purposes I was happy to accept the default storage locations for managing my notebooks and I have not tried to configure the use of an SSL service to manage client authentication. Obviously, there is a lot more work that I need to do around the basic setup and configuration procedures which hopefully I will be able to explore at some point in time - watch this space!

OK, now we have everything in place: software, check…. port configuration, check. It’s time to start your engine!

Starting Zeppelin

This is the easy part. Within the bin directory there is a shell script to run the Zeppelin daemon:

. ../my-zeppelin/bin/zeppelin-daemon.sh start

There is a long list of command line environment settings that you can use, see here: https://zeppelin.apache.org/docs/0.6.2/install/install.html. In my Virtualbox environment I found it useful to configure the following settings:

ZEPPELIN_MEM: amount of memory available to Zeppelin. The default setting is - -Xmx1024m -XX:MaxPermSize=512m
ZEPPELIN_INTP_MEM: amount of memory available to the Zeppelin Interpreter (connection) engine and the default setting is derived from the setting of ZEPPELIN_MEM
ZEPPELIN_JAVA_OPTS: simply lists any additional JVM options

therefore, my startup script looks like this:

set ZEPPELIN_MEM=-Xms1024m -Xmx4096m -XX:MaxPermSize=2048m

set ZEPPELIN_INTP_MEM=-Xms1024m -Xmx4096m -XX:MaxPermSize=2048m

set ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=8g -Dspark.cores.max=16"

. ../my-zeppelin/bin/zeppelin-daemon.sh start

Fingers crossed, once Zeppelin has started the following message should appear on your command line:

 Zeppelin start                                             [  OK  ]

Connecting to Zeppelin

Everything should now be in place to test whether your Zeppelin environment is up and running. Open a browser and type the ip address/host name and port reference which in my case is: http://localhost:7081/#/ then the home page should appear:

Fullsizeoutput 1050

The landing pad interface is nice and simple.

In the top right-hand corner you will see a green light which tells me that the Zeppelin service is up and running. “anonymous” is my user id because I have not enabled client side authentication. In the main section of the welcome screen you will see links to the help system and the community pages, which is where you can log any issues that you find.

The Notebook section is where all the work is done and this is where I am going to spend the next post exploring in some detail. If you are used using a normal BI tool then Zeppelin (along with most other notebook applications) will take some getting used to because it creating reports follows is more of scripting-style process rather than a wizard-driven click-click process you get with products like Oracle Business Intelligence. Anyway, more on this later,

What is an Interpreter?

To build notebooks in Zeppelin you need to make connections to your data sources. This is done using something called an “Interpreter”. This is a plug-in which enables Zeppelin to use not only a specific query language but also provides access to backend data-processing capabilities. For example, it is possible to include shell scripting code within a Zeppelin notebook by using the %sh interpreter. To access an Oracle Database we use the JDBC interpreter. Obviously, you might want to have lots of different JDBC-based connections - maybe you have an Oracle 11g instance, a 12cR1 instance and a 12c R2 instance. Zeppelin allows you to create new interpreters and define their connection characteristics.

It’s at this point that version 0.6.2 and versions 0.7.x diverge. Each has its own setup and configuration process for interpreters so I will explain the process for each version separately. Firstly, we need to track down some JDBC files…

Configuring your JDBC files

Finally, we have reached the point of this post - connecting Zeppelin to your Oracle data warehouse. But before we dive into setting up connections we need to track down some Oracle specific jdbc files. You will need to locate one of the following files to use with Zeppelin: ojdbc7.jar (Database 12c Release 1) or ojdbc8.jar (Database 12c Release 2).

You can either copy the relevant file to your Zeppelin server or simply point the Zeppelin interpreter to the relevant directory. My preference is to keep everything contained within the Zeppelin folder structure so I have taken my Oracle JDBC files and moved them to my Zeppelin server. If you want to find the JDBC files that come with your database version then you need to find the jdbc folder within your version-specific folder. In my 12c Release 2 environment this was located in the folder shown below:

Fullsizeoutput 1052

alternatively, I could have copied the files from my local SQL Developer installation:

Fullsizeoutput 1054

take the jdbc file(s) and copy them to the /interpreter/jdbc directory within your Zeppelin installation directory, as shown below:

Fullsizeoutput 1057

Creating an Interpreter for Oracle Database

At last we are finally ready to create a connection to our Oracle Database! Make a note of the directory containing the Oracle JDBC file because you will need that information during the configuration process. There is a difference between the different versions of Zeppelin in terms of creating a connection to an Oracle database/PDB.

Personally, I think the process in version 0.7.x makes more sense but the performance of jdbc is truly dreadful for some reason. There is obviously been a major change of approach in terms of how connections are managed within Zeppelin and this seems to causing a few issues. Digging around in the documentation it would appear that 0.8.x version will be available shortly so I am hoping the version 0.7x connection issues will be resolved!

Process for creating a connection using version 0.6.2

Starting from the home page (just click on the word “Zeppelin” in the top left corner of your browser or open a new window and connect to http://localhost:7081/#/), then click on the username “anonymous” which will reveal a pulldown menu. Select “Interpreter” as shown below:

Fullsizeoutput 105e

this will take you to the home page for managing your connections, or interpreters. Each query language and data processing language has its own interpreter and these are all listed in alphabetical order.

Fullsizeoutput 105f

scroll down until you find the entry for jdbc:

Fullsizeoutput 1060

here you will see that the jdbc interpreter is already configured for two separate connections: postgres and hive. By clicking on the “edit” button on the right-hand side we can add new connection attributes and in this case I have removed the hive and postgres attributes and added new attributes

osql.driver
osql.password
osql.url
osql.user

the significance of the “osql.” prefix will become obvious when we start to build our notebooks - essentially this will be our reference to these specific connection details. I have added a dependency by including an artefact that points to the location of my jdbc file. In the screenshot below you will see that I am connecting to the example sales history schema owned by user sh, password sh, which I have installed in my pluggable database dw2pdb2. The listener port for my jdbc connection is 1521.

If you have access to SQL Developer then an easy solution for testing your connection details is to setup a new connection and run the test connection routine. If SQL Developer connects to your database/pdb using your jdbc connection string then Zeppelin should also be able to connect successfully. FYI…error messages in Zeppelin are usually messy and long listings of a Java program stack. Not easy to workout where the problem actually originates. Therefore, the more you can test outside of Zeppelin the easier life will be - at least that is what I have found!

Below is my enhanced configuration for the jdbc interpreter:

Fullsizeoutput 1061

The default.driver is simply the entry point into the Oracle jdbc driver which is oracle.jdbc.driver.OracleDriver. The last task is to add an artifact [sic] that points to the location of the Oracle JDBC file. In this case I have pointed to the 12c Release 1 driver stored in the ../zeppelin/intepreter/jdbc folder.

Process for creating a connection using version 0.7.x

As before, starting from the home page (just click on the word “Zeppelin” in the top left corner of your browser or open a new window and connect to http://localhost:7081/#/), then click on the username “anonymous” which will reveal a pulldown menu shown below:
Fullsizeoutput 105e

now with version 0.7.0 and 0.7.1 we need to actually create a new interpreter, therefore, just click on the “+Create” button:

Fullsizeoutput 1063

this will bring up the “Create new interpreter” form that will allow you to define the attributes for the new interpreter:

Fullsizeoutput 1067

I will name my new interpreter “osql” and assign it to the JDBC group:

Fullsizeoutput 1068

this will pre-populate the form with the default attributes needed to define a JDBC-type connection such as:

default.driver: driver entry point into the Oracle JDBC driver
default.password: Oracle user password
default.url: JDBC connection string to access the Oracle database/pDB
default.user: Oracle username

the initial form will look like this:

and in my case I need to connect to a PDB called dw2pdb2 on the same server accessed via the listener port 1521, the username is sh and the password is sh. The only non-obvious entry is the default.driver which is oracle.jdbc.driver.OracleDriver. As before, the last task is to add an artifact [sic] that points to the location of the Oracle JDBC file. In this case I have pointed to the 12c Release 2 driver stored in the ../zeppelin/intepreter/jdbc folder.

Once you have entered the configuration settings, hit Save and your form should look like this:

Fullsizeoutput 106b

Testing your new interpreter

To test the your interpreter will successfully connect to your database/pdb and run a SQL statement we need to create a new notebook. Go back to the home page and click on the “Create new note” link in the list on the left side of the screen.

Fullsizeoutput 1073

Enter a name for your new note:

Fullsizeoutput 1072

which will bring you to the notebook screen which is where you write your scripts - in this case SQL statements. This is similar in layout and approach as many worksheet-based tools (SQL Developer, APEX SQL Worksheet etc etc). If you are using version 0.6.x of Zeppelin then you can bypass the following…

If you are using version 0.7.x then we have to bind our SQL interpreter (osql) to this new note which will allow us to run SQL commands against the sh schema. To add the osql interpreter simply click on the gear icon in the top right-hand side of the screen:

Fullsizeoutput 1070

this will then show you the list of interpreters which are available to this new note. You can switch interpreters on and off by clicking on them and for this example I have reduced the number of interpreters to just the following: markup (md), shell scripting (sh), file management (file), our Oracle SH pdb connection (osq) and jdbc connections (jdbc). Once you are done, click on the “Save” button to return to the note.

I will explain the layout the of the note interface in my next post. For the purposes of testing the connection to my pdb I need to use the “osql” interpreter and give it a SQL statement to run. This is two-lines of code as shown here

Fullsizeoutput 1077

On the right side of the screen there is a triangle icon which is will execute or “Run” my SQL statement:

SELECT sysdate FROM dual

note that I have not included a semi-colon (;) at the end of the SQL statement! In version 0.6.2 if you include the semi-colon (;) you will get a java error. Version 0.7x is a little more tolerant and does not object to having or not having a semi-colon (;).

Using my Virtualbox environment the first time I make a connection to execute a SQL statement the query takes 2-3 minutes to establish the connection to my PDB and then run the query. This is true even for simple queries such as SELECT * FROM dual. Once the first query has completed then all subsequent queries run in the normal expected timeframe (i.e. around the same time as executing the query from within SQL Developer).

Eventually, the result will be displayed. By default, output is shown in tabular layout (as you can see from the list of available icons, "graph-based layouts are also available"

Fullsizeoutput 1078

…and we have now established that the connection to our SH schema is working.

Summary

In this post we have covered the following topics:

How to install Zeppelin
How configure and start Zeppelin
Finding and installing the correct Oracle JDBC drivers
Set up a connection to an Oracle PDB and tested the connection

As we have seen during this post, there are some key differences between the 0.6.x and 0.7.x versions of Zeppelin in terms of the way interpreters (connections) are defined. Now we have a fully working environment (Zeppelin connected to my Oracle 12c Release 2 PDB which includes sales history sample schema).

Therefore, in my next post I am going to look at how you can use the powerful notebook interface to access remote data files, load data into a schema, create both tabular and graph-based reports, briefing books and even design simple dashboards. Stay tuned for more information about how to use Zeppelin with Oracle Database.

If you are already using Zeppelin against your Oracle Database and would like to share your experiences that would be great - please use the comments feature below or feel free to send me an email: keith.laker@oracle.com.

(image at top of post is courtesy of wikipedia)

Technorati Tags: Data Warehousing, Oracle Database 12c, Zeppelin

↧

Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

July 11, 2017, 1:24 am

≫ Next: PDF version - Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

≪ Previous: Connecting Apache Zeppelin to your Oracle Data Warehouse

It’s here - at last! I have just pushed my usual must-see guide to the Apple iBooks Store.

The free big data warehousing Must-See guide for OpenWorld 2017 is now available for download from the Apple iBooks Store - click here, and yes it’s completely free. This comprehensive guide covers everything you need to know about this year’s Oracle OpenWorld conference so that when you arrive at Moscone Conference Center you are ready to get the most out of this amazing conference. The guide contains the following information:


Page 8 - On-Demand Videos	Page 17 - Justify Your trip	Page 19 - Key Presenters	Page 41 - Must See Sessions	Page 90 - Useful Maps

Chapter 1 - Introduction to the must-see guide.

Chapter 2 - A guide to the key the highlights from last year’s conference so you can relive the experience or see what you missed. Catch the most important highlights from last year's OpenWorld conference with our on demand video service which covers all the major keynote sessions. Sit back and enjoy the highlights. The second section explains why you need to attend this year’s conference and how to justify it to your company.

Chapter 3 - Full list of Oracle Product Management and Development presenters who will be at this year’s OpenWorld. Links to all their social media sites are included alongside each profile. Read on to find out about the key people who can help you and your teams build the FUTURE using Oracle’s Data Warehouse and Big Data technologies.

Chapter 4 - List of the “must-see” sessions at this year’s OpenWorld by category. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2016.

Chapter 5 - Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data. This covers all our websites, blogs and social media pages.

Chapter 6 - Details of our exclusive web application for smartphones and tablets provides you with a complete guide to everything related to data warehousing and big data at OpenWorld 2016.

Chapter 7 - Information to help you find your way around the area surrounding the Moscone Conference Center this section includes some helpful maps.

What’s missing? At the moment there is no information about hands-on labs or the demogrounds but as soon as that information is available I will update the contents and push it to the iBooks Store. Stay tuned for update notifications posted on Twitter, Facebook, Google+ and LinkedIn.

Let me know if you have any comments. Enjoy.

Technorati Tags: Analytics, Big Data, Conference, Data Warehousing, OpenWorld, Spatial, SQL Analytics

↧

PDF version - Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

July 17, 2017, 2:38 am

≫ Next: MATCH_RECOGNIZE and predicates - everything you need to know

≪ Previous: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

….and now it’s here in PDF format as well!

The free big data warehousing Must-See guide for OpenWorld 2017 is now available for download in PDF format - click here, and yes it’s completely free. This comprehensive guide covers everything you need to know about this year’s Oracle OpenWorld conference so that when you arrive at Moscone Conference Center you are ready to get the most out of this amazing conference. The guide contains the following information:


Page 8 - On-Demand Videos	Page 17 - Justify Your trip	Page 19 - Key Presenters	Page 41 - Must See Sessions	Page 90 - Useful Maps

Chapter 1 - Introduction to the must-see guide.
Chapter 2 - A guide to the key the highlights from last year’s conference so you can relive the experience or see what you missed. Catch the most important highlights from last year's OpenWorld conference with our on demand video service which covers all the major keynote sessions. Sit back and enjoy the highlights. The second section explains why you need to attend this year’s conference and how to justify it to your company.
Chapter 3 - Full list of Oracle Product Management and Development presenters who will be at this year’s OpenWorld. Links to all their social media sites are included alongside each profile. Read on to find out about the key people who can help you and your teams build the FUTURE using Oracle’s Data Warehouse and Big Data technologies.
Chapter 4 - List of the “must-see” sessions at this year’s OpenWorld by category. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2016.
Chapter 5 - Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data. This covers all our websites, blogs and social media pages.
Chapter 6 - Details of our exclusive web application for smartphones and tablets provides you with a complete guide to everything related to data warehousing and big data at OpenWorld 2016.
Chapter 7 - Information to help you find your way around the area surrounding the Moscone Conference Center this section includes some helpful maps.

What’s missing? At the moment there is no information about hands-on labs or the demogrounds but as soon as that information is available I will update the contents and push it to the iBooks Store. Stay tuned for update notifications posted on Twitter, Facebook, Google+ and LinkedIn.
To download the PDF version of this guide, click here.
Let me know if you have any comments. Enjoy.

Technorati Tags: Analytics, Big Data, Conference, Data Warehousing, OpenWorld, Spatial, SQL Analytics

↧

MATCH_RECOGNIZE and predicates - everything you need to know

July 17, 2017, 7:24 am

≫ Next: Sneak preview of BDW OpenWorld smartphone app...

≪ Previous: PDF version - Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

MATCH_RECOGNIZE and predicates

At a recent user conference I had a question about when and how predicates are applied when using MATCH_RECOGNIZE so that’s the purpose of this blog post. Will this post cover everything you will ever need to know for this topic? Probably!
Where to start….the first thing to remember is that the table listed in the FROM clause of your SELECT statement acts as the input into the MATCH_RECOGNIZE pattern matching process and this raises the question about how and where are predicates actually applied. I briefly touched on this topic in part 1 of my deep dive series on MATCH_RECOGNIZE: SQL Pattern Matching Deep Dive - Part 1.
In that first post I looked at the position of predicates within the explain plan and their impact on sorting. In this post I am going to use the built in measures (MATCH_NUMBER and CLASSIFIER) to show the impact of applying predicates to the results that are returned.
First, if you need a quick refresher course in how to use the MATCH_RECOGNIZE built-in measures then see part 2 of the deep dive series: SQL Pattern Matching Deep Dive - Part 2, using MATCH_NUMBER() and CLASSIFIER().
As per usual I am going to use my normal stock ticker schema to illustrate the specific points. You can find this schema listed on most of the pattern matching examples on livesql.oracle.com. There are three key areas within the MATCH_RECOGNIZE clause that impact on predicates…

PARTITION BY column
ORDER BY column
All other columns

1. Predicates on the PARTITION BY column

Let’ start with a simple query:

select * from ticker
MATCH_RECOGNIZE(
  PARTITION BY symbol ORDER BY tstamp
  MEASURES match_number() as mn
  ALL ROWS PER MATCH
PATTERN (strt)
 DEFINE strt as 1=1
);

Note that we are using an always-true pattern STRT which is defined as 1=1 to ensure that we process all rows and the pattern has no range so it will be matched once and then reset to find the next match. As our ticker table contains 60 rows, the output also contains 60 rows

Output with no predicates

Checkout the column headed mn which contains our match_numnber() measure. This shows that within the first partition for ACME we matched the always-true event 20 times, i.e. all rows were matched. If we check the explain plan for this query we can see that all 60 rows (3 symbols, and 20 rows for each symbol) were processed:
Explain plan for query with no predicates

Explain plan for query with no predicates

If we now apply a predicate on the PARTITION BY column, SYMBOL, then we can see that the first “block” of our output looks exactly the same, however, the explain plan shows that we have processed fewer rows - only 20 rows.
Let’ modify and rerun our simple query:

select * from ticker
MATCH_RECOGNIZE(
  PARTITION BY symbol ORDER BY tstamp
  MEASURES match_number() as mn
  ALL ROWS PER MATCH
PATTERN (strt+)
 DEFINE strt as 1=1
)
WHERE symbol = ‘ACME';

the results look similar but note that the output summary returned by SQL Developer indicates that only 20 rows were fetched:
Query with predicate on partition by column

Query with predicate on partition by column

notice that the match_number() column (mn) is showing 1 - 20 as values returned from the pattern matching process. If we look at the explain plan….
Explain for query with simple WHERE clause

Explain for query with simple WHERE clause

…this also shows that we processed 20 rows - so partition elimination filtered out the other 40 rows before pattern matching started. Therefore, if you apply predicates on the PARTITION BY column then MATCH_RECOGNIZE is smart enough to perform partition elimination to reduce the number of rows that need to be processed.

Conclusion - predicates on the PARTITION BY column.

Predicates on the partition by column reduce the amount of data being passed into MATCH_RECOGNIZE.
Built-in measures such as MATCH_NUMBER work as expected in that a contiguous sequence is returned.

2. Predicates on the ORDER BY column

What happens if we apply a predicate to the ORDER BY column? Let’s amend the query and add a filter on the tstamp column:

select * from ticker
MATCH_RECOGNIZE(
 PARTITION BY symbol ORDER BY tstamp
 MEASURES match_number() as mn
 ALL ROWS PER MATCH
 PATTERN (strt)
 DEFINE strt as 1=1
)
WHERE symbol='ACME'
AND tstamp BETWEEN '01-APR-11' AND '10-APR-11';

returns a smaller resultset of only 10 rows and match_number is correctly sequenced from 1-10 - as expected:
Results from predicates on PARTITION BY and ORDER BY columns

Results from predicates on PARTITION BY and ORDER BY columns

however, the explain plan shows that we processed all the rows within the partition (20).
Explain plan for PARTITION BY and ORDER BY predicates

This becomes a little clearer if remove the predicate on the SYMBOL column:

select * from ticker
MATCH_RECOGNIZE(
 PARTITION BY symbol ORDER BY tstamp
 MEASURES match_number() as mn
 ALL ROWS PER MATCH
 PATTERN (strt)
 DEFINE strt as 1=1
)
WHERE tstamp BETWEEN ’01-APR-11' AND '10-APR-11';

now we see that 30 rows are returned
Query with predicates only on ORDER BY column

but all 60 rows have actually been processed!
Explain for query with predicates on ORDER BY column

Conclusion

Filters applied to non-partition by columns are applied after the pattern matching process has completed: rows are passed in to MATCH_RECOGNIZE, the pattern is matched and then predicates on the ORDER BY/other columns are applied.
Is there a way to prove that this is actually what is happening?

3.Using other columns

Lets add another column to our ticker table that shows the day name for each trade. Now let’s rerun the query with the predicate on the SYMBOL column:

select * from ticker
MATCH_RECOGNIZE(
  PARTITION BY symbol ORDER BY tstamp
  MEASURES match_number() as mn
  ALL ROWS PER MATCH
PATTERN (strt)
 DEFINE strt as 1=1
)
WHERE symbol = ‘ACME';

the column to note is MN which contains a contiguous sequence of numbers from 1 to 20.
What happens if we filter on the day_name column and only keep the working-week days (Mon-Fri):

select * from ticker
MATCH_RECOGNIZE(
  PARTITION BY symbol ORDER BY tstamp
  MEASURES match_number() as mn
  ALL ROWS PER MATCH
PATTERN (strt)
 DEFINE strt as 1=1
)
WHERE symbol = ‘ACME'
AND day_name in (‘MONDAY’, ’TUESDAY’, ‘WEDNESDAY’, ’THURSDAY’, ‘FRIDAY’);

now if we look at the match_number column, mn, we can see that the sequence is no longer contiguous: the value in row 2 is now 4 and not 2, row 7 the value of mn is 11 even though the previous row was 8:

Fullsizeoutput ff7

It is still possible to “access” the rows that have been removed. Consider the following query with the measure PREV(day_name):

select * from ticker
MATCH_RECOGNIZE(
 PARTITION BY symbol ORDER BY tstamp
 MEASURES match_number() as mn,
prev(day_name) as prev_day
 ALL ROWS PER MATCH
 PATTERN (strt)
 DEFINE strt as 1=1
)
WHERE symbol='ACME'
AND day_name in ('MONDAY', 'WEDNESDAY', 'FRIDAY');

this returns the following:
Fullsizeoutput ff9

where you can see that on row 2 the value for SUNDAY has been returned even though logically looking at the results the previous day should be FRIDAY.
This has important implications for numerical calculations such as running totals, final totals, averages, counts, min and max etc etc because these will take into account all the matches (depending on how your pattern is defined) prior to the final set of predicates (i.e. non-PARTITION BY columns) being applied.

One last example

Let’s now change the always-true pattern to search for as many rows as possible (turn it into a greedy quantifier)

select symbol, tstamp, mn, price, day_name, prev_day, total_rows from ticker
MATCH_RECOGNIZE(
 PARTITION BY symbol ORDER BY tstamp
 MEASURES match_number() as mn,
          prev(day_name) as prev_day,
          count(*) as total_rows 
 ALL ROWS PER MATCH
 PATTERN (strt+)
 DEFINE strt as 1=1
)
WHERE symbol='ACME'
AND day_name in ('MONDAY', 'WEDNESDAY', 'FRIDAY');

the results from the following two queries:
Query 1:

select symbol, tstamp, mn, price, day_name, prev_day, total_rows, avg_price, max_price 

from ticker
MATCH_RECOGNIZE(
 PARTITION BY symbol ORDER BY tstamp
 MEASURES match_number() as mn,
 prev(day_name) as prev_day,
 count(*) as total_rows,
 trunc(avg(price),2) as avg_price,
 max(price) as max_price
 ALL ROWS PER MATCH
PATTERN (strt+)
 DEFINE strt as 1=1
)
WHERE symbol=‘ACME';

Query 2:

select symbol, tstamp, mn, price, day_name, prev_day, total_rows, avg_price, max_price from ticker
MATCH_RECOGNIZE(
 PARTITION BY symbol ORDER BY tstamp
 MEASURES match_number() as mn,
   prev(day_name) as prev_day,
   count(*) as total_rows,
   trunc(avg(price),2) as avg_price,
   max(price) as max_price
 ALL ROWS PER MATCH
PATTERN (strt+)
 DEFINE strt as 1=1
)
WHERE symbol='ACME'
AND day_name in ('MONDAY', 'WEDNESDAY', 'FRIDAY');

the number of rows returned is different but the values for the calculated columns (previous day, count, max and min) are exactly the same:
Resultset 1:
Fullsizeoutput ffd

Resultset 2:
Fullsizeoutput ffc

Conclusion

When I briefly touched on this topic in part 1 of my deep dive series on MATCH_RECOGNIZE, SQL Pattern Matching Deep Dive - Part 1, the focus was on the impact predicates had on sorting - would additional sorting take place if predicates were used.
In this post I have looked at the impact on the data returned. Obviously by removing rows at the end of processing there can be a huge impact on calculated measures such as match_number, counts and averages etc.
Hope this has been helpful. If you have any questions then feel free to send me an email: keith.laker@oracle.com.

Main image courtesy of wikipedia

Technorati Tags: Analytics, Database 12c, Pattern Matching, SQL

↧

Sneak preview of BDW OpenWorld smartphone app...

August 11, 2017, 7:07 am

≫ Next: UPDATED: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

≪ Previous: MATCH_RECOGNIZE and predicates - everything you need to know

DW and Big Data - OpenWorld 2017

↧

UPDATED: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

August 29, 2017, 2:02 am

≫ Next: #oow17 BDW Smartphone App Now Live

≪ Previous: Sneak preview of BDW OpenWorld smartphone app...

** NEW ** Chapter 5

Fullsizeoutput 10f6

* UPDATED * Must-See Guide now available as PDF and via Apple iBooks Store

This updated version now contains details of all the most important hands-on labs AND a day-by-day calendar. This means that our comprehensive guide now covers absolutely everything you need to know about this year’s Oracle OpenWorld conference. Now, when you arrive at Moscone Conference Center you are ready to get the absolute most out of this amazing conference.

The updated, and still completely free, big data warehousing Must-See guide for OpenWorld 2017 is now available for download from the Apple iBooks Store - click here, and in PDF format - click here.

Just so you know…this guide contains the following information:


Page 8 - On-Demand Videos	Page 11 - Justify Your trip	Page 18 - Key Presenters	Page 39 - Must See Sessions	Page 83 - Must See Day-by-Day	Page 150 - Useful Maps

Chapter 1 - Introduction to the must-see guide.

Chapter 2 - A guide to the key the highlights from last year’s conference so you can relive the experience or see what you missed. Catch the most important highlights from last year's OpenWorld conference with our on demand video service which covers all the major keynote sessions. Sit back and enjoy the highlights. The second section explains why you need to attend this year’s conference and how to justify it to your company.

Chapter 3 - Full list of Oracle Product Management and Development presenters who will be at this year’s OpenWorld. Links to all their social media sites are included alongside each profile. Read on to find out about the key people who can help you and your teams build the FUTURE using Oracle’s Data Warehouse and Big Data technologies.

Chapter 4 - List of the “must-see” sessions and hands-on labs at this year’s OpenWorld by category. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2017.

Chapter 5 - Day-by-Day “must-see” guide. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2017.

Chapter 6 - Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data. This covers all our websites, blogs and social media pages.

Chapter 7 - Details of our exclusive web application for smartphones and tablets provides you with a complete guide to everything related to data warehousing and big data at OpenWorld 2017.

Chapter 8 - Information to help you find your way around the area surrounding the Moscone Conference Center this section includes some helpful maps.

Let me know if you have any comments. Enjoy and see you in San Francisco.

Technorati Tags: Analytics, Big Data, Conference, Data Warehousing, OpenWorld, Spatial, SQL Analytics

↧

#oow17 BDW Smartphone App Now Live

September 26, 2017, 4:55 pm

≫ Next: OpenWord 2017 - Must-See Sessions for Day 1 - Sunday

≪ Previous: UPDATED: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

It’s only 4 days and counting until OpenWorld 2017 starts.

If you are coming to this year’s conference then you will definitely want to use our completely free #oow17 online BDW app for smartphones and tablets which is now live: https://keithlaker.github.io/Storyboard.html#LandingPad.

The app includes a day-by-day calendar of all the most important sessions covered in the comprehensive Big Data Warehousing Must-See Guide, see here: https://oracle-big-data.blogspot.co.uk/2017/08/updated-big-data-warehousing-must-see.html.

The day-by-day coverage breaks sessions down into 5 categories:

Data Warehousing and Cloud
Analytics and Machine Learning
Unstructured and development
Big Data
Hands-on labs

each session is color coded to the above topics making it easier to focus on the areas that are most important to you. Please send my any feedback. Hope the app is useful and enjoy OpenWorld 2017: Your Data Warehouse Transformation Starts Here.


Landing page which includes video for #oow17	Main page app with links to the day-by-day session guides

If you have a QR Code reader then point your smartphone at this QR code to automatically load the smartphone web app:

↧

OpenWord 2017 - Must-See Sessions for Day 1 - Sunday

September 30, 2017, 12:58 pm

≫ Next: OpenWorld 2017: Must-See Sessions for Day 2 - Monday

≪ Previous: #oow17 BDW Smartphone App Now Live

It all starts today - OpenWorld 2017. Each day I will provide you with a list of must-see sessions and hands-on labs. This is going to be one of the most exciting OpenWorlds ever!

Today is Day 1 so here here is my definitive list of Must-See sessions for the opening day. The list is packed full of really excellent speakers such as Franck Pachot, Ami Aharonovich, Galo Balda and Rich Niemiec. These sessions are what Oracle OpenWorld is all about: the chance to learn from the real technical experts.

Of course you need to end your first day in Moscone North Hall D for Larry Ellison's welcome keynote - it's going to be a great one!

SUNDAY'S MUST-SEE GUIDE

Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday.

The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference.

If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.

Don't forget to make use of our Big DW #oow17 smartphone app which you can access by pointing your phone at this QR code:

↧

OpenWorld 2017: Must-See Sessions for Day 2 - Monday

October 1, 2017, 7:27 pm

≫ Next: OpenWord 2017: Must-See Sessions for Day 3 - Tuesday

≪ Previous: OpenWord 2017 - Must-See Sessions for Day 1 - Sunday

Day 2, Monday, is here and this is my definitive list of Must-See sessions for today. The list is packed full of sessions and labs that follow on from yesterday's (Sunday) big announcements around Oracle Autonomous Database and Oracle Autonomous Data Warehouse Cloud. These sessions are what Oracle OpenWorld is all about: the chance to learn about the latest technology from the real technical experts.

MONDAY's MUST-SEE GUIDE

Have a great conference.

If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.

Don't forget to make use of our Big DW #oow17 smartphone app which you can access by pointing your phone at this QR code:

↧

OpenWord 2017: Must-See Sessions for Day 3 - Tuesday

October 2, 2017, 4:33 pm

≫ Next: OpenWorld 2017: Must-See Sessions for Day 4 - Wednesday

≪ Previous: OpenWorld 2017: Must-See Sessions for Day 2 - Monday

Day 3, Tuesday, is here and this is my definitive list of Must-See sessions for today.

Today we are focused on the new features in Oracle Database 18c - multitenant, in-memory, Oracle Text, machine learning, Big Data SQL etc etc. These sessions are what Oracle OpenWorld is all about: the chance to learn about the latest technology from the real technical experts.

MONDAY's MUST-SEE GUIDE

Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday.

The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well.
Have a great conference.

If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.

Don't forget to make use of our Big DW #oow17 smartphone app which you can access by pointing your phone at this QR code:

↧

OpenWorld 2017: Must-See Sessions for Day 4 - Wednesday

October 3, 2017, 8:05 pm

≫ Next: Review of Big Data Warehousing at OpenWorld 2017 - Now Available

≪ Previous: OpenWord 2017: Must-See Sessions for Day 3 - Tuesday

Day 4 is here which makes today #Autonomous Wednesday. Included my definitive list of Must-See sessions for today are two of THE most important sessions at this years conference. You will not want to miss these two sessions:

Fpbhdancoapcoben

The rest of list is, of course, packed full of sessions and labs covering our Big Data Warehouse technologies and features. These sessions are what Oracle OpenWorld is all about: the chance to learn about the latest technology from the real technical experts.

WEDNESDAY'S MUST-SEE GUIDE

Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday.

The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well.
Have a great conference.

If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.

Don't forget to make use of our Big DW #oow17 smartphone app which you can access by pointing your phone at this QR code:

↧