# 13.4: Cardinality

Often times we are interested in the number of items in a set or subset. This is called the cardinality of the set.

Cardinality

The number of elements in a set is the cardinality of that set.

The cardinality of the set (A) is often notated as (|A|) or (n(A))

Example 12

Let (A={1,2,3,4,5,6}) and (B={2,4,6,8})

What is the cardinality of (B ? A cup B, A cap B ?)

Solution

The cardinality of (B) is (4,) since there are 4 elements in the set.

The cardinality of (A cup B) is (7,) since (A cup B={1,2,3,4,5,6,8},) which contains 7 elements.

The cardinality of (A cap B) is 3 , since (A cap B={2,4,6}), which contains 3 elements.

Example 13

What is the cardinality of (P=) the set of English names for the months of the year?

Solution

The cardinality of this set is (12,) since there are 12 months in the year.

Sometimes we may be interested in the cardinality of the union or intersection of sets, but not know the actual elements of each set. This is common in surveying.

Example 14

A survey asks 200 people “What beverage do you drink in the morning”, and offers choices:

• Tea only
• Coffee only
• Both coffee and tea

Suppose 20 report tea only, 80 report coffee only, 40 report both. How many people drink tea in the morning? How many people drink neither tea or coffee?

Solution

This question can most easily be answered by creating a Venn diagram. We can see that we can find the people who drink tea by adding those who drink only tea to those who drink both: 60 people.

We can also see that those who drink neither are those not contained in the any of the three other groupings, so we can count those by subtracting from the cardinality of the universal set, 200.

(200-20-80-40=60) people who drink neither.

Example 15

A survey asks: Which online services have you used in the last month:

• Have used both

The results show 40% of those surveyed have used Twitter, 70% have used Facebook, and 20% have used both. How many people have used neither Twitter or Facebook?

Solution

Let (T) be the set of all people who have used Twitter, and (F) be the set of all people who have used Facebook. Notice that while the cardinality of (F) is (70 \%) and the cardinality of (T) is (40 \%), the cardinality of (F cup T) is not simply (70 \%+40 \%), since that would count those who use both services twice. To find the cardinality of (F cup T), we can add the cardinality of (F) and the cardinality of (T), then subtract those in intersection that we've counted twice. In symbols,

(mathrm{n}(F cup T)=mathrm{n}(F)+mathrm{n}(T)-mathrm{n}(F cap T))

(mathrm{n}(F cup T)=70 \%+40 \%-20 \%=90 \%)

Now, to find how many people have not used either service, we're looking for the cardinality of ((F cup T)^{c}). since the universal set contains (100 \%) of people and the cardinality of (F cup T=90 \%), the cardinality of ((F cup 7)^{c}) must be the other (10 \%)

The previous example illustrated two important properties

Cardinality properties

(mathrm{n}(A cup B)=mathrm{n}(A)+mathrm{n}(B)-mathrm{n}(A cap B))

(nleft(A^{circ} ight)=n(U)-n(A))

Notice that the first property can also be written in an equivalent form by solving for the cardinality of the intersection:

(mathrm{n}(A cap B)=mathrm{n}(A)+mathrm{n}(B)-mathrm{n}(A cup B))

Example 16

Add text here.Fifty students were surveyed, and asked if they were taking a social science (SS), humanities (HM) or a natural science (NS) course the next quarter.

(egin{array}{ll} ext{21 were taking a SS course} & ext{26 were taking a HM course} ext{19 were taking a NS course} & ext{9 were taking SS and HM} ext{7 were taking SS and NS} & ext{10 were taking HM and NS} ext{3 were taking all three} & ext{7 were taking none} end{array})

How many students are only taking a SS course?

Solution

It might help to look at a Venn diagram.

From the given data, we know that there are 3 students in region (e) and 7 students in region (h)

since 7 students were taking a (S S) and (N S) course, we know that (n(d)+n(e)=7). since we know there are 3 students in region 3 , there must be
(7-3=4) students in region (d)

Similarly, since there are 10 students taking (mathrm{HM}) and (mathrm{NS}), which includes regions (e) and (f), there must be

(10-3=7) students in region (f)

Since 9 students were taking (mathrm{SS}) and (mathrm{HM}), there must be (9-3=6) students in region (b)

Now, we know that 21 students were taking a SS course. This includes students from regions (a, b, d,) and (e .) since we know the number of students in all but region (a,) we can determine that (21-6-4-3=8) students are in region (a)

8 students are taking only a SS course.

Try it Now 4

One hundred fifty people were surveyed and asked if they believed in UFOs, ghosts, and Bigfoot.

(egin{array}{ll} ext{43 believed in UFOs} & ext{44 believed in ghosts} ext{25 believed in Bigfoot} & ext{10 believed in UFOs and ghosts} ext{8 believed in ghosts and Bigfoot} & ext{5 believed in UFOs and Bigfoot} ext{2 believed in all three} & ext{} end{array})

How many people surveyed believed in at least one of these things?

Starting with the intersection of all three circles, we work our way out. since 10 people believe in UFOs and Ghosts, and 2 believe in all three, that leaves 8 that believe in only UFOs and Ghosts. We work our way out, filling in all the regions. Once we have, we can add up all those regions, getting 91 people in the union of all three sets. This leaves (150-91=59) who believe in none.

All tests are written in so-called feature files. Feature files are plain text files ending with .feature. A feature file can contain only one BDD Feature written in a natural language format called Gherkin. However, radish is able to run one or more feature files. The feature files can be passed to radish as arguments:

A Feature is the main part of a feature file. Each feature file must contain exactly one Feature. This Feature should represent a test for a single feature in your software similar to a test class in your unit code tests. The Feature is composed of a Feature sentence and a Feature description. The feature sentence is a short precise explanation of the feature which is tested with this Feature. The feature description as a more verbose explanation of the feature which is tested. There you can answer the Why and What questions. A Feature has the following syntax:

A Feature must contain one or more Scenarios which are run when this feature file is executed.

## 13.2 Choosing an Optimizer Goal

By default, the goal of the query optimizer is the best throughput. This means that it chooses the least amount of resources necessary to process all rows accessed by the statement. Oracle can also optimize a statement with the goal of best response time. This means that it uses the least amount of resources necessary to process the first row accessed by a SQL statement.

Choose a goal for the optimizer based on the needs of your application:

For applications performed in batch, such as Oracle Reports applications, optimize for best throughput. Usually, throughput is more important in batch applications, because the user initiating the application is only concerned with the time necessary for the application to complete. Response time is less important, because the user does not examine the results of individual statements while the application is running.

For interactive applications, such as Oracle Forms applications or SQL*Plus queries, optimize for best response time. Usually, response time is important in interactive applications, because the interactive user is waiting to see the first row or first few rows accessed by the statement.

The optimizer's behavior when choosing an optimization approach and goal for a SQL statement is affected by the following factors:

### 13.2.1 OPTIMIZER_MODE Initialization Parameter

The OPTIMIZER_MODE initialization parameter establishes the default behavior for choosing an optimization approach for the instance. The possible values and description are listed in Table 13-2.

Table 13-2 OPTIMIZER_MODE Initialization Parameter Values

The optimizer uses a cost-based approach for all SQL statements in the session regardless of the presence of statistics and optimizes with a goal of best throughput (minimum resource use to complete the entire statement). This is the default value.

The optimizer uses a cost-based approach, regardless of the presence of statistics, and optimizes with a goal of best response time to return the first n number of rows n can equal 1, 10, 100, or 1000.

The optimizer uses a mix of cost and heuristics to find a best plan for fast delivery of the first few rows.

Note: Using heuristics sometimes leads the query optimizer to generate a plan with a cost that is significantly larger than the cost of a plan without applying the heuristic. FIRST_ROWS is available for backward compatibility and plan stability use FIRST_ROWS_ n instead.

You can change the goal of the query optimizer for all SQL statements in a session by changing the parameter value in initialization file or by the ALTER SESSION SET OPTIMIZER_MODE statement. For example:

The following statement in an initialization parameter file establishes the goal of the query optimizer for all sessions of the instance to best response time:

The following SQL statement changes the goal of the query optimizer for the current session to best response time:

If the optimizer uses the cost-based approach for a SQL statement, and if some tables accessed by the statement have no statistics, then the optimizer uses internal information, such as the number of data blocks allocated to these tables, to estimate other statistics for these tables.

### 13.2.2 Optimizer SQL Hints for Changing the Query Optimizer Goal

To specify the goal of the query optimizer for an individual SQL statement, use one of the hints in Table 13-3. Any of these hints in an individual SQL statement can override the OPTIMIZER_MODE initialization parameter for that SQL statement.

Table 13-3 Hints for Changing the Query Optimizer Goal

This hint instructs Oracle to optimize an individual SQL statement with a goal of best response time to return the first n number of rows, where n equals any positive integer. The hint uses a cost-based approach for the SQL statement, regardless of the presence of statistic.

This hint explicitly chooses the cost-based approach to optimize a SQL statement with a goal of best throughput.

Chapter 16, "Using Optimizer Hints" for information on how to use hints

### 13.2.3 Query Optimizer Statistics in the Data Dictionary

The statistics used by the query optimizer are stored in the data dictionary. You can collect exact or estimated statistics about physical storage characteristics and data distribution in these schema objects by using the DBMS_STATS package.

To maintain the effectiveness of the query optimizer, you must have statistics that are representative of the data. For table columns that contain values with large variations in number of duplicates, called skewed data, you should collect histograms.

The resulting statistics provide the query optimizer with information about data uniqueness and distribution. Using this information, the query optimizer is able to compute plan costs with a high degree of accuracy. This enables the query optimizer to choose the best execution plan based on the least cost.

If no statistics are available when using query optimization, the optimizer will do dynamic sampling depending on the setting of the OPTMIZER_DYNAMIC_SAMPLING initialization parameter. This may cause slower parse times so for best performance, the optimizer should have representative optimizer statistics.

"Viewing Histograms" for a description of histograms

Detailed Descriptions for the elements in the EnrollmentRequest resource.

This resource provides the insurance enrollment details to the insurer regarding a specified coverage.

The status of the resource instance.

This element is labeled as a modifier because the status contains codes that mark the request as not currently valid.

The date when this resource was created.

The Insurer who is target of the request.

The practitioner who is responsible for the services rendered to the patient.

Reference to the program or plan identification, underwriter or payor.

Need to identify the issuer to target for processing and for coordination of benefit processing.

®© HL7.org 2011+. FHIR Release 4 (Technical Correction #1) (v4.0.1) generated on Fri, Nov 1, 2019 09:35+1100. QA Page

## Contents

In the affine cipher the letters of an alphabet of size m are first mapped to the integers in the range 0 … m − 1 . It then uses modular arithmetic to transform the integer that each plaintext letter corresponds to into another integer that correspond to a ciphertext letter. The encryption function for a single letter is

where modulus m is the size of the alphabet and a and b are the keys of the cipher. The value a must be chosen such that a and m are coprime. The decryption function is

where a −1 is the modular multiplicative inverse of a modulo m . I.e., it satisfies the equation

The multiplicative inverse of a only exists if a and m are coprime. Hence without the restriction on a , decryption might not be possible. It can be shown as follows that decryption function is the inverse of the encryption function,

Since the affine cipher is still a monoalphabetic substitution cipher, it inherits the weaknesses of that class of ciphers. The Caesar cipher is an Affine cipher with a = 1 since the encrypting function simply reduces to a linear shift. The Atbash cipher uses a = −1 .

Considering the specific case of encrypting messages in English (i.e. m = 26 ), there are a total of 286 non-trivial affine ciphers, not counting the 26 trivial Caesar ciphers. This number comes from the fact there are 12 numbers that are coprime with 26 that are less than 26 (these are the possible values of a ). Each value of a can have 26 different addition shifts (the b value) therefore, there are 12 × 26 or 312 possible keys. This lack of variety renders the system as highly insecure when considered in light of Kerckhoffs' Principle.

The cipher's primary weakness comes from the fact that if the cryptanalyst can discover (by means of frequency analysis, brute force, guessing or otherwise) the plaintext of two ciphertext characters then the key can be obtained by solving a simultaneous equation. Since we know a and m are relatively prime this can be used to rapidly discard many "false" keys in an automated system.

The same type of transformation used in affine ciphers is used in linear congruential generators, a type of pseudorandom number generator. This generator is not a cryptographically secure pseudorandom number generator for the same reason that the affine cipher is not secure.

In these two examples, one encrypting and one decrypting, the alphabet is going to be the letters A through Z, and will have the corresponding values found in the following table.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

### Encrypting Edit

In this encrypting example, [1] the plaintext to be encrypted is "AFFINE CIPHER" using the table mentioned above for the numeric values of each letter, taking a to be 5, b to be 8, and m to be 26 since there are 26 characters in the alphabet being used. Only the value of a has a restriction since it has to be coprime with 26. The possible values that a could be are 1, 3, 5, 7, 9, 11, 15, 17, 19, 21, 23, and 25. The value for b can be arbitrary as long as a does not equal 1 since this is the shift of the cipher. Thus, the encryption function for this example will be y = E(x) = (5x + 8) mod 26 . The first step in encrypting the message is to write the numeric values of each letter.

 plaintext x A F F I N E C I P H E R 0 5 5 8 13 4 2 8 15 7 4 17

Now, take each value of x , and solve the first part of the equation, (5x + 8) . After finding the value of (5x + 8) for each character, take the remainder when dividing the result of (5x + 8) by 26. The following table shows the first four steps of the encrypting process.

 plaintext x (5x + 8) (5x + 8) mod 26 A F F I N E C I P H E R 0 5 5 8 13 4 2 8 15 7 4 17 8 33 33 48 73 28 18 48 83 43 28 93 8 7 7 22 21 2 18 22 5 17 2 15

The final step in encrypting the message is to look up each numeric value in the table for the corresponding letters. In this example, the encrypted text would be IHHWVCSWFRCP. The table below shows the completed table for encrypting a message in the Affine cipher.

 plaintext x (5x + 8) (5x + 8) mod 26 ciphertext A F F I N E C I P H E R 0 5 5 8 13 4 2 8 15 7 4 17 8 33 33 48 73 28 18 48 83 43 28 93 8 7 7 22 21 2 18 22 5 17 2 15 I H H W V C S W F R C P

### Decrypting Edit

In this decryption example, the ciphertext that will be decrypted is the ciphertext from the encryption example. The corresponding decryption function is D(y) = 21(y − 8) mod 26 , where a −1 is calculated to be 21, and b is 8. To begin, write the numeric equivalents to each letter in the ciphertext, as shown in the table below.

 ciphertext y I H H W V C S W F R C P 8 7 7 22 21 2 18 22 5 17 2 15

Now, the next step is to compute 21(y − 8) , and then take the remainder when that result is divided by 26. The following table shows the results of both computations.

 ciphertext y 21(y − 8) 21(y − 8) mod 26 I H H W V C S W F R C P 8 7 7 22 21 2 18 22 5 17 2 15 0 −21 −21 294 273 −126 210 294 −63 189 −126 147 0 5 5 8 13 4 2 8 15 7 4 17

The final step in decrypting the ciphertext is to use the table to convert numeric values back into letters. The plaintext in this decryption is AFFINECIPHER. Below is the table with the final step completed.

 ciphertext y 21(y − 8) 21(y − 8) mod 26 plaintext I H H W V C S W F R C P 8 7 7 22 21 2 18 22 5 17 2 15 0 −21 −21 294 273 −126 210 294 −63 189 −126 147 0 5 5 8 13 4 2 8 15 7 4 17 A F F I N E C I P H E R

### Entire alphabet encoded Edit

To make encrypting and decrypting quicker, the entire alphabet can be encrypted to create a one-to-one map between the letters of the cleartext and the ciphertext. In this example, the one-to-one map would be the following:

letter in the cleartext A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
number in the cleartext 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
(5x + 8) mod 26 8 13 18 23 2 7 12 17 22 1 6 11 16 21 0 5 10 15 20 25 4 9 14 19 24 3
ciphertext letter I N S X C H M R W B G L Q V A F K P U Z E J O T Y D

### Programming examples Edit

The Following Python code can be used to encrypt text with the affine cipher:

## Problem Set 6

ANSWER: Let (X) be a random variable that represents key (i) hashing to the first bucket.

Taking Expectation on both sides

Therefore, option 6 is correct.

ANSWER: Option 4 is correct. For the lower bound, if there is a violation of the search tree property, we might need to examine all of the nodes to find it (in the worst case). For the upper bound, we can determine search tree property by looking at all of the nodes.

ANSWER: Option 4 is correct. For the lower bound, note that a linear number of quantities need to be computed. For the upper bound, recursively compute the sizes of the left and right subtrees, and use the formula size(x) = 1 + size(y) + size(z) from lecture.

• The hash function should “spread out” every data set (across the buckets/slots of the hash table).
• The hash function should “spread out” most (i.e., “non-pathological”) data sets (across the buckets/slots of the hash table).
• The hash function should be easy to store (constant space or close to it).
• The hash function should be easy to compute (constant time or close to it).

ANSWER: Options 2, 3, and 4 are desirable properties of a good hash function. We can wish for option 1, but no known hash function has achieved it thus, it is practically not expected of a well-designed hash function.

• Every red-black tree is also a relaxed red-black tree.
• The height of every relaxed red-black tree with (n) nodes is (O(log n)).
• There is a relaxed red-black tree that is not also a red-black tree.
• Every binary search tree can be turned into a relaxed red-black tree (via some coloring of the nodes as black or red).

ANSWER: Option 1 is correct by definition.

Video leature 13.4 proves that in a red-black tree with (n) nodes, there is a root-NULL path with at most (log_2(n + 1)) black nodes, and thus at most (2log_2(n + 1)) total nodes. Since a relaxed red-black tree may contain two red nodes for every black node, the total number of nodes from a root-NULL path is at most (3log_2(n + 1)). Thus, the height is (O(log n)), and therefore, option 2 is correct.

Option 3 is correct simply because a regular red-black tree doesn’t allow two red nodes in a row, but the relaxed one does. So, any relaxed red-black tree with two red nodes in a row is not a regular red-black tree.

Option 4 is incorrect. Consider the following BST:

It can’t be turned into a relaxed red-black tree simply by coloring, because no matter how we color the nodes, invariant four is violated that all root-NULL paths must have the same number of black nodes. Obviously, the path (1 - NULL) (going left from the root) has only one black node, namely the root itself ((1)), but there are at least two black nodes in the path (1 - 4).

ANSWER: Since hashing of the keys are independent events, therefore, the probability of a given pair of distinct keys hashing to the same bucket is simply (frac<1> * frac<1>). Since all buckets are equally likely, the probability of a given pair of distinct keys hashing to any bucket is (m * frac<1> * frac<1> = frac<1>). Therefore, option 2 is correct.

ANSWER: There are (inom<2>) pairs of distinct keys. By the previous problem, each pair has a (frac<1>) chance of colliding. Therefore, the expected number of pairs of distinct keys that collide is given by:

Therefore, option 5 is correct.

ANSWER: The probabibilty of false positive, (epsilon), is approximately ((1 - e^>)^k), where (b) is the number of bits per object and (k) is the number of hash functions. (k) is approximated by (0.693 imes b = 11). Plugging that in the formula for (epsilon), we get ((1 - e^<16>>)^ <11>approx 0.0004), which is (.04\%). Therefore, option 2 is correct (option 1 is correct as well, but 2 is a stronger statement).

#### You May Also Enjoy

First, let's get into what skewed means versus uniform.

Here is an unskewed distribution that is not uniform. This is the standard normal bell curve.

Here is a skewed distribution ( $F_<5,5>$ ).

However, both distributions have values that they prefer. In the normal distribution, for instance, you would expect to get samples around 0 more than you would expect values around 2. Therefore, the distributions are nor uniform. A uniform distribution would be something like how a die has a 1/6 chance of landing on each number.

I see your problem as being akin to checking if a die is biased towards particular numbers. In your first example, ecah number between 1 and 10 is equally represented. You have a uniform distribution on $<1,2,3,4,5,6,7,8,9,10>$ .

$P(X = 1) = P(X=2) = cdots = P(X=9) = P(X=10) = frac<1><10>$

In your second example, you have some preference for 1 and 2 at the expense of 3.

Number of unique items has nothing to do with the uniformity.

What I think you want to do is test if your sample indicates a preference for particular numbers. If you roll a die 12 times and get $<3,2,6,5,4,1,2,1,3,4,5,4>$ , you'd notice that you have a slight preference for 4 at the expense of 6. However, you'd probably call this just luck of the draw and that if you did the experiment again, you'd be just as likely to get that 6 is preferred at the expense of some other number. The lack of uniformity is due to sampling variability (chance or luck of the draw, but nothing suggesting that the die lacks balance). Similarly, if you flip a coin four times and get HHTH, you probably won't think anything is fishy. That seems perfectly plausible for a fair coin.

However, what if you roll the die 12,000 or 12 billion times and still get a preference for 4 at the expense of 6, or you do billions of coin flips and find that heads is preferred 75% of the time? Then you'd start thinking that there is a lack of balance and that the lack of uniformity in your observations is not just due to random chance.

There is a statistical hypothesis test to quantify this. It's called Pearson's chi-squared test. The example on Wikipedia is pretty good. I'll summarize it here. It uses a die.

$H_0: P(X=1) = cdots = P(X=6) = frac<1><6>$

This means that we are assuming equal probabilities of each face of the die and trying to find evidence suggesting that is false. This is called the null hypothesis.

Out alternative hypothesis is that $H_0$ is false, that some probability is not $frac<1><6>$ and the lack of uniformity in the observations is not due to chance alone.

We conduct an experiment of rolling the die 60 times. "The number of times it lands with 1, 2, 3, 4, 5, and 6 face up is 5, 8, 9, 8, 10, and 20, respectively."

For face 1, we would expect 10, but we got 5. This is a difference of 5. Then we square the difference to get 25. Then we divide by the expected number to get 2.5.

For face 2, we would expect 10, but we got 8. This is a difference of 2. Then we square the difference to get 4. Then we divide by the expected number to get 0.4.

Do the same for the remaining faces to get 0.1, 0.4, 0, and 10.

Now add up all of the values: $2.5 + 0.4 + 0.1 + 0.4 + 0 + 10 = 13.4$ . This is our test statistic. We test against a $chi^2$ distribution with 5 degrees of freedom. We get five because there are six outcomes, and we subtract 1. Now we can get our p-value! The R command to do that is "pchisq(13.4,5,lower.tail=F)" (don't put the quotation marks in R). The result is about 0.02, meaning that there is only a 2% chance of getting this level of non-uniformity (or more) due to random chance alone. It is common to reject the null hypothesis when the p-value is less than 0.05, so at the 0.05-level, we can say that we reject the null hypothesis in favor of the alternative. However, if we want to test at the 0.01-level, we lack sufficient evidence to say that the die is biased.

Try this out for an experiment where you roll a die 180 times and get 1, 2, 3, 4, 5, and 6 in the amounts of 60, 15, 24, 24, 27, and 30, respectively. When I do this in R, I get a p-value of about $1.36 imes 10^<-7>$ (1.36090775991073e-07 is the printout).

Now for the shortcut in R. Hover over the hidden text when you think you get the idea of this test and can do it by hand but don't want to.

## Importing and Exporting Optimizer Statistics

You can export and import optimizer statistics from the data dictionary to user-defined statistics tables. You can also copy statistics from one database to another database.

Importing and exporting are especially useful for testing an application using production statistics. You use DBMS_STATS to export schema statistics from a production database to a test database so that developers can tune execution plans in a realistic environment before deploying applications.

When you transport optimizer statistics between databases, you must use DBMS_STATS to copy the statistics to and from a staging table, and tools to make the table contents accessible to the destination database.

Importing and exporting are especially useful for testing an application using production statistics. You use DBMS_STATS.EXPORT_SCHEMA_STATS to export schema statistics from a production database to a test database so that developers can tune execution plans in a realistic environment before deploying applications.

The following figure illustrates the process using Oracle Data Pump and ftp .

Figure 13-5 Transporting Optimizer Statistics

Description of "Figure 13-5 Transporting Optimizer Statistics"

As shown in Figure 13-5, the basic steps are as follows:

In the production database, copy the statistics from the data dictionary to a staging table using DBMS_STATS.EXPORT_SCHEMA_STATS .

Export the statistics from the staging table to a .dmp file using Oracle Data Pump.

Transfer the .dmp file from the production host to the test host using a transfer tool such as ftp .

In the test database, import the statistics from the .dmp file to a staging table using Oracle Data Pump.

Copy the statistics from the staging table to the data dictionary using DBMS_STATS.IMPORT_SCHEMA_STATS .

### Transporting Optimizer Statistics to a Test Database

Transports statistics using the DBMS_STATS.EXPORT_SCHEMA_STATS procedure.

Prerequisites and Restrictions

When preparing to export optimizer statistics, note the following:

Before exporting statistics, you must create a table to hold the statistics. The procedure DBMS_STATS.CREATE_STAT_TABLE creates the statistics table.

The optimizer does not use statistics stored in a user-owned table. The only statistics used by the optimizer are the statistics stored in the data dictionary. To make the optimizer use statistics in user-defined tables, import these statistics into the data dictionary using the DBMS_STATS import procedure.

The Data Pump Export and Import utilities export and import optimizer statistics from the database along with the table. When a column has system-generated names, Original Export ( exp ) does not export statistics with the data, but this restriction does not apply to Data Pump Export.

Exporting and importing statistics using DBMS_STATS is a distinct operation from using Data Pump Export and Import.

This tutorial assumes the following:

You want to generate representative sh schema statistics on a production database and use DBMS_STATS to import them into a test database.

Administrative user dba1 exists on both production and test databases.

You intend to create table opt_stats to store the schema statistics.

You intend to use Oracle Data Pump to export and import table opt_stats .

To generate schema statistics and import them into a separate database:

On the production host, start SQL*Plus and connect to the production database as administrator dba1 .

Create a table to hold the production statistics.

For example, execute the following PL/SQL program to create user statistics table opt_stats :

For example, manually gather schema statistics as follows:

Use DBMS_STATS to export the statistics.

For example, retrieve schema statistics and store them in the opt_stats table created previously:

Use Oracle Data Pump to export the contents of the statistics table.

For example, run the expdp command at the operating schema prompt:

Transfer the dump file to the test database host.

Log in to the test host, and then use Oracle Data Pump to import the contents of the statistics table.

For example, run the impdp command at the operating schema prompt:

On the test host, start SQL*Plus and connect to the test database as administrator dba1 .

Use DBMS_STATS to import statistics from the user statistics table and store them in the data dictionary.

The following PL/SQL program imports schema statistics from table opt_stats into the data dictionary:

Oracle Database PL/SQL Packages and Types Reference to learn about the DBMS_STATS.CREATE_STAT_TABLE function

Oracle Database PL/SQL Packages and Types Reference for an overview of the statistics transfer functions

Oracle Database Utilities to learn about Oracle Data Pump

## 2.13 Random Forest Software in R

The oldest and most well known implementation of the Random Forest algorithm in R is the randomForest package. There are also a number of packages that implement variants of the algorithm, and in the past few years, there have been several “big data” focused implementations contributed to the R ecosystem as well.

Here is a non-comprehensive list:

Since there are so many different Random Forest implementations available, there have been several benchmarks to compare the performance of popular implementations, including implementations outside of R. A few examples:

### 2.13.1 randomForest

Authors: Fortran original by LeoBreiman and Adele Cutler, R port by AndyLiaw and Matthew Wiener.

• This package wraps the original Fortran code by Leo Breiman and Adele Culter and is probably the most widely known/used implemenation in R.
• Although it’s single-threaded, smaller forests can be trained in parallel by writing custom foreach or parall el code, then combined into a bigger forest using the randomForest::combine() function.
• Row weights unimplemented (been on the wishlist for as long as I can remember).
• Uses CART trees split by Gini Impurity.
• Categorical predictors are allowed to have up to 53 categories.
• Multinomial response can have no more than 32 categories.
• Supports R formula interface (but I’ve read some reports that claim it’s slower when the formula interface is used).

### 2.13.2 caret method “parRF”

Backend: Fortran (wraps the randomForest package)

This is a wrapper for the randomForest package that parallelizes the tree building.

### 2.13.3 h2o

Authors: Jan Vitek, Arno Candel, H2O.ai contributors

• Distributed and parallelized computation on either a single node or a multi- node cluster.
• Automatic early stopping based on convergence of user-specied metrics to user- specied relative tolerance.
• Data-distributed, which means the entire dataset does not need to fit into memory on a single node.
• Uses histogram approximations of continuous variables for speedup.
• Uses squared error to determine optimal splits.
• Automatic early stopping based on convergence of user-specied metrics to user- specied relative tolerance.
• Support for exponential families (Poisson, Gamma, Tweedie) and loss functions in addition to binomial (Bernoulli), Gaussian and multinomial distributions, such as Quantile regression (including Laplace).
• Grid search for hyperparameter optimization and model selection.
• Model export in plain Java code for deployment in production environments.
• GUI for training & model eval/viz (H2O Flow).

Implementation details are presented in slidedecks by Michal Mahalova and Jan Vitek.

### 2.13.4 Rborist

The Arborist provides a fast, open-source implementation of the Random Forest algorithm. The Arborist achieves its speed through efficient C++ code and parallel, distributed tree construction. This slidedeck provides detail about the implementation and vision of the project.

• Began as proprietary implementation, but was open-sourced and rewritten following dissolution of venture.
• Project called “Aborist” but R package is called “Rborist”. A Python interface is in development.
• CPU based but a GPU version called Curborist (Cuda Rborist) is in development (unclear if it will be open source).
• Unlimited factor cardinality.
• Emphasizes multi-core but not multi-node.
• Both Python support and GPU support have been “coming soon” since summer 2015, not sure the status of the projects.

### 2.13.5 ranger

Authors: Marvin N. Wright and Andreas Ziegler

Ranger is a fast implementation of random forest (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data. Classification, regression, probability estimation and survival forests are supported. Classification and regression forests are implemented as in the original Random Forest (Breiman 2001), survival forests as in Random Survival Forests (Ishwaran et al. 2008). For probability estimation forests see Malley et al. (2012).