Welcome to R for Loss Data Analytics. This site provides files that generate R codes to support the online text Loss Data Analytics. Within the site, there are explanations available to guide you through the specific R functions used in a certain code chunk so that you can understand what the code is doing and why we need to use such a function. In order to use this site most effectively, it is helpful to learn and understand the following programming tools in R. By having this background knowledge, you can manipulate the given code to suit your application.
These are tools that you should know as you use this site. Introduction to R is a free course available on DataCamp that can help you understand them in detail.
These are tools that are helpful to know. These are tools used throughout the site, though we deem them as optional to know since it is not required to understand the code provided.
We suggest using R Markdown when writing a report. R Markdown is a package available in RStudio that allows you to record both your code and output in a single file. You can refer here to learn more about this package.
This file contains illustrative R code for computing analysis on the Property Fund data. When reviewing this code, you should open an R session, copy-and-paste the code, and see it perform. Then, you will be able to change parameters, look up commands, and so forth, as you go. This code uses the dataset PropertyFundin_sample.csv
Read in Property Fund data.
in_sample <- read.csv("Data/PropertyFundInsample.csv", header = T,
na.strings = c("."), stringsAsFactors = FALSE)
in_sample_2010 <- subset(in_sample, Year == 2010)
A few quick notes on these commands:
read.csv
reads a csv file and creates a data frame from it, with cases corresponding to rows and variables to columns in the file.
The assignment operator <-
is analogous to an equal sign in mathematics. The command in_sample <- read.csv("Data/PropertyFundInsample.csv", header=T, na.strings=c("."), stringsAsFactors = FALSE)
means we give the name in_sample
to the data read.
The subset()
function is used to select variables and observations. In this illustration, we selected observations from year 2010.
In 2010 there were 1,110 policyholders in the property fund.
Table 1.1 shows the distribution of the 1,377 claims.
Table 1.1
library(pander)
table <- as.data.frame(table(in_sample_2010$Freq))
names(table) <- c("Number of Claims", "Frequency")
pander(t(table))
Number of Claims | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Frequency | 707 | 209 | 86 | 40 | 18 | 12 | 9 | 4 | 6 | 1 | 3 | 2 |
Number of Claims | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 30 | 39 | 103 | 239 |
Frequency | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
The average number of claims for this sample was 1.24 ( = 1377/1110). See table 1.2 below.
Table 1.2
pander(summary(in_sample_2010$Freq))
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
0 | 0 | 0 | 1.241 | 1 | 239 |
A few quick notes on these commands:
Many useful R functions come in packages and to use these functions you have to install them. One way to install a package is by using the command line install.packages("<the package's name>")
. In addition, to read more about a function you use the command help("function name")
.
The pander
function is used here to create nicer tables than regular R output. To use this function you need to download the pander
package. For the normal R output in the illustration above, use the command line summary(in_sample_2010$Freq)
.
The names()
function is used to to get or assign names of an object . In this illustration, we assigned Number of Claims
and Frequency
to the two columns in the data frame.
The t()
function is used to transpose a data frame or a matrix.
Table 1.3 summarizes the sample distribution of average severity from the 403 policyholders.
Table 1.3
in_sample_pos_2010 <- subset(in_sample_2010, yAvg > 0)
pander(summary(in_sample_pos_2010$yAvg))
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
166.7 | 2226 | 4951 | 56332 | 11900 | 12922218 |
length(in_sample_pos_2010$yAvg)
[1] 403
Note: The length()
function sets the length of a vector (list) or other objects.
Figure 1.2 provides further information about the distribution of sample claims, showing a distribution that is dominated by this single large claim so that the histogram is not very helpful. Even when removing the large claim, you will find a distribution that is skewed to the right. A generally accepted technique is to work with claims in logarithmic units especially for graphical purposes; the corresponding figure in the right-hand panel is much easier to interpret.
Figure 1.2
par(mfrow = c(1, 2))
hist(in_sample_pos_2010$yAvg, main = "", xlab = "Average Claims")
hist(log(in_sample_pos_2010$yAvg), main = "", xlab = "Logarithmic Average Claims")
A few quick notes on these commands:
The par(mfrow)
function is handy for creating a simple multi-paneled plot. mfrow
is a vector of length 2, where the first argument specifies the number of rows and the second the number of columns of plots.
The hist()
computes a histogram of the given data values. You put the name of your dataset in between the parentheses of this function.
Earlier we considered a sample of 1,110 observations which may seem like a lot. However, as we will seen in our forthcoming applications, because of the preponderance of zeros and the skewed nature of claims, actuaries typically yearn for more data. One common approach that we adopt here is to examine outcomes from multiple years, thus increasing the sample size.
Table 1.4 shows that the average claim varies over time.
Table 1.4
library(doBy)
t_1a <- summaryBy(Freq ~ Year, data = in_sample,
FUN = function (x) { c(m = mean(x), num = length(x)) } )
t_1b <- summaryBy(yAvg ~ Year, data = in_sample,
FUN = function (x) { c(m = mean(x), num = length(x)) } )
t_1c <- summaryBy(BCcov ~ Year, data = in_sample,
FUN = function (x) { c(m = mean(x), num = length(x)) } )
table1_in <- cbind(t_1a[1], t_1a[2], t_1b[2], t_1c[2], t_1a[3])
names(table1_in) <- c("Year", "Average Freq", "Average Sev", "Average Coverage",
"No. of Policyholders")
pander(table1_in)
Year | Average Freq | Average Sev | Average Coverage | No. of Policyholders |
---|---|---|---|---|
2006 | 0.9515 | 9695 | 32498186 | 1154 |
2007 | 1.167 | 6544 | 35275949 | 1138 |
2008 | 0.9742 | 5311 | 37267485 | 1125 |
2009 | 1.219 | 4572 | 40355382 | 1112 |
2010 | 1.241 | 20452 | 41242070 | 1110 |
A few quick notes on these commands:
The summaryBy()
function provides summary statistics of a variable across different groups. You need to install the doBy
package to use the command.
The cbind()
combines vector, matrix or data frame arguments by columns. The row number of the two datasets must be equal.
The c()
function combines its arguments to form a vector.
For a different look at this five-year sample, Table 1.5 summarizes the distribution of our two outcomes, frequency and claims amount. In each case, the average exceeds the median, suggesting that the distributions are right-skewed.
Table 1.5
BCcov.div1000 <- (in_sample$BCcov) / 1000
t_1 <- summaryBy(Freq ~ 1, data = in_sample,
FUN = function (x) { c(ma = min(x), m1 = median(x), m = mean(x), mb = max(x)) } )
names(t_1) <- c("Minimum", "Median", "Average", "Maximum")
t_2 <- summaryBy(yAvg ~ 1, data = in_sample,
FUN = function (x) { c(ma = min(x), m1 = median(x), m = mean(x), mb = max(x)) } )
names(t_2) <- c("Minimum", "Median", "Average", "Maximum")
t_3 <- summaryBy(Deduct ~ 1, data = in_sample,
FUN = function (x) { c(ma = min(x), m1 = median(x), m = mean(x), mb = max(x)) } )
names(t_3) <- c("Minimum", "Median", "Average", "Maximum")
t_4 <- summaryBy(BCcov.div1000 ~ 1, data = in_sample,
FUN = function (x) { c(ma = min(x), m1 = median(x), m = mean(x), mb = max(x)) } )
names(t_4) <- c("Minimum", "Median","Average", "Maximum")
table_2 <- rbind(t_1, t_2, t_3, t_4)
table_2a <- round(table_2, 3)
row_label <- rbind("Claim Frequency", "Claim Severity", "Deductible", "Coverage (000's)")
table_2aa <- cbind(row_label, as.matrix(table_2a))
pander(table_2aa)
Minimum | Median | Average | Maximum | |
---|---|---|---|---|
Claim Frequency | 0 | 0 | 1.109 | 263 |
Claim Severity | 0 | 0 | 9291.565 | 12922217.84 |
Deductible | 500 | 1000 | 3364.87 | 1e+05 |
Coverage (000’s) | 8.937 | 11353.566 | 37280.855 | 2444796.98 |
A few quick notes on these commands:
The rbind()
combines vector, matrix or data frame arguments by rows. The column of the two datasets must be same.
The round()
function rounds the values in its first argument to the specified number of decimal places (default 0).
Table 1.6 describes the rating variables considered in this chapter. To handle the skewness, we henceforth focus on logarithmic transformations of coverage and deductibles. See table 1.6 below for variables and variable descriptions.
Table 1.6
des <- read.table(header = TRUE, text = '
Variable Description
"BCcov" "Total building and content coverage in dollars"
"Deduct" "Deductible in dollars"
"Entity Type" "Categorical variable that is one of six types:
(Village, City, County, Misc, School, or Town)"
"alarm_credit" "Categorical variable that is one of four types:
(0%, 5%, 10%, or 15%), for automatic smoke alarms in main rooms"
"NoClaimCredit" "Binary variable to indicate no claims in the past two years"
"Fire5" "Binary variable to indicate the fire class is below 5.
(The range of fire class is 0~10)" ')
pander(des)
Variable | Description |
---|---|
BCcov | Total building and content coverage in dollars |
Deduct | Deductible in dollars |
Entity Type | Categorical variable that is one of six types: (Village, City, County, Misc, School, or Town) |
alarm_credit | Categorical variable that is one of four types: (0%, 5%, 10%, or 15%), for automatic smoke alarms in main rooms |
NoClaimCredit | Binary variable to indicate no claims in the past two years |
Fire5 | Binary variable to indicate the fire class is below 5. (The range of fire class is 0~10) |
To get a sense of the relationship between the non-continuous rating variables and claims, Table 1.7 relates the claims outcomes to these categorical variables. Table 1.7 shows claims summary by Entity Type, Fire Class, and No Claim Credit.
Table 1.7
# Table 1.7
by_var_summ <- function (datasub) {
temp_a <- summaryBy(Freq ~ 1 , data = datasub,
FUN = function (x) { c(m = mean(x), num = length(x)) } )
datasub_1 <- subset(datasub, yAvg > 0)
temp_b <- summaryBy(yAvg ~ 1, data = datasub_1,
FUN = function (x) { c(m = mean(x)) } )
temp_c <- merge(temp_a, temp_b, all.x = T)[c(2, 1, 3)]
temp_c1 <- as.matrix(temp_c)
return(temp_c1)
}
datasub <- subset(in_sample, TypeVillage == 1);
t_1 <- by_var_summ(datasub)
datasub <- subset(in_sample, TypeCity == 1);
t_2 <- by_var_summ(datasub)
datasub <- subset(in_sample, TypeCounty == 1);
t_3 <- by_var_summ(datasub)
datasub <- subset(in_sample, TypeMisc == 1);
t_4 <- by_var_summ(datasub)
datasub <- subset(in_sample, TypeSchool == 1);
t_5 <- by_var_summ(datasub)
datasub <- subset(in_sample, TypeTown == 1);
t_6 <- by_var_summ(datasub)
datasub <- subset(in_sample, Fire5 == 0);
t_7 <- by_var_summ(datasub)
datasub <- subset(in_sample, Fire5 == 1);
t_8 <- by_var_summ(datasub)
datasub <- subset(in_sample, in_sample$NoClaimCredit == 0);
t_9 <- by_var_summ(datasub)
datasub <- subset(in_sample, in_sample$NoClaimCredit == 1);
t_10 <- by_var_summ(datasub)
t_11 <- by_var_summ(in_sample)
table_a <- rbind(t_1, t_2, t_3, t_4, t_5, t_6, t_7, t_8, t_9, t_10, t_11)
table_aa <- round(table_a, 3)
row_label <- rbind("Village", "City", "County", "Misc", "School",
"Town", "Fire5--No", "Fire5--Yes", "NoClaimCredit--No",
"NoClaimCredit--Yes", "Total")
table_4 <- cbind(row_label, as.matrix(table_aa))
pander(table_4)
Freq.num | Freq.m | yAvg.m | |
---|---|---|---|
Village | 1341 | 0.452 | 10645.206 |
City | 793 | 1.941 | 16924.035 |
County | 328 | 4.899 | 15453.206 |
Misc | 609 | 0.186 | 43036.076 |
School | 1597 | 1.434 | 64346.394 |
Town | 971 | 0.103 | 19831.048 |
Fire5–No | 2508 | 0.502 | 13935.421 |
Fire5–Yes | 3131 | 1.596 | 41421.263 |
NoClaimCredit–No | 3786 | 1.501 | 31365.085 |
NoClaimCredit–Yes | 1853 | 0.31 | 30498.714 |
Total | 5639 | 1.109 | 31206.155 |
Table 1.8 shows claims summary by Entity Type and Alarm Credit
Table 1.8
by_var_summ <- function(datasub) {
temp_a <- summaryBy(Freq ~ AC00 , data = datasub,
FUN = function(x) { c(m = mean(x), num = length(x)) } )
datasub_1 <- subset(datasub, yAvg > 0)
if (nrow(datasub_1) == 0) { n <- nrow(datasub)
return(c(0, 0, n))
} else
{
temp_b <- summaryBy(yAvg ~ AC00, data = datasub_1,
FUN = function(x) { c(m = mean(x)) } )
temp_c <- merge(temp_a, temp_b, all.x = T)[c(2, 4, 3)]
temp_c1 <- as.matrix(temp_c)
return(temp_c1)
}
}
alarm_c <- 1 * (in_sample$AC00 == 1) + 2 * (in_sample$AC05 == 1) +
3 * (in_sample$AC10 == 1) + 4 * (in_sample$AC15 == 1)
by_var_credit<-function(ACnum){
datasub <- subset(in_sample, TypeVillage == 1 & alarm_c == ACnum);
t_1 <- by_var_summ(datasub)
datasub <- subset(in_sample, TypeCity == 1 & alarm_c == ACnum);
t_2 <- by_var_summ(datasub)
datasub <- subset(in_sample, TypeCounty == 1 & alarm_c == ACnum);
t_3 <- by_var_summ(datasub)
datasub <- subset(in_sample, TypeMisc == 1 & alarm_c == ACnum);
t_4 <- by_var_summ(datasub)
datasub <- subset(in_sample, TypeSchool == 1 & alarm_c == ACnum);
t_5 <- by_var_summ(datasub)
datasub <- subset(in_sample, TypeTown == 1 & alarm_c ==ACnum);
t_6 <- by_var_summ(datasub)
datasub <- subset(in_sample, alarm_c == ACnum);
t_7 <- by_var_summ(datasub)
table_a <- rbind(t_1, t_2, t_3, t_4, t_5, t_6, t_7)
table_aa <- round(table_a, 3)
row_label <- rbind("Village", "City", "County", "Misc", "School", "Town","Total")
table_4 <- cbind(row_label,as.matrix(table_aa))
}
table_4a <- by_var_credit(1) # claims summary by entity type and alarm credit == 00
table_4b <- by_var_credit(2) # claims summary by entity type and alarm credit == 05
table_4c <- by_var_credit(3) # claims summary by entity type and alarm credit == 10
table_4d <- by_var_credit(4) # claims summary by entity type and alarm credit == 15
pander(table_4a) # claims summary by entity type and alarm credit == 00
Freq.m | yAvg.m | Freq.num | |
---|---|---|---|
Village | 0.326 | 11077.997 | 829 |
City | 0.893 | 7575.979 | 244 |
County | 2.14 | 16012.719 | 50 |
Misc | 0.117 | 15122.127 | 386 |
School | 0.422 | 25522.708 | 294 |
Town | 0.083 | 25257.084 | 808 |
Total | 0.318 | 15118.491 | 2611 |
pander(table_4b) # claims summary by entity type and alarm credit == 05
Freq.m | yAvg.m | Freq.num | ||
---|---|---|---|---|
Village | 0.278 | 8086.057 | 54 | |
City | 2.077 | 4150.125 | 13 | |
t_3 | County | 0 | 0 | 1 |
Misc | 0.278 | 13063.933 | 18 | |
School | 0.41 | 14575.003 | 122 | |
Town | 0.194 | 3937.29 | 31 | |
Total | 0.431 | 10762.112 | 239 |
pander(table_4c) # claims summary by entity type and alarm credit == 10
Freq.m | yAvg.m | Freq.num | |
---|---|---|---|
Village | 0.5 | 8792.376 | 50 |
City | 1.258 | 8625.169 | 31 |
County | 2.125 | 11687.969 | 8 |
Misc | 0.077 | 3923.375 | 26 |
School | 0.488 | 11596.912 | 168 |
Town | 0.091 | 2338.06 | 44 |
Total | 0.517 | 10194.094 | 327 |
pander(table_4d) # claims summary by entity type and alarm credit == 15
Freq.m | yAvg.m | Freq.num | |
---|---|---|---|
Village | 0.725 | 10543.752 | 408 |
City | 2.485 | 20469.514 | 505 |
County | 5.513 | 15475.74 | 269 |
Misc | 0.341 | 87020.878 | 179 |
School | 2.008 | 85139.974 | 1013 |
Town | 0.261 | 9489.613 | 88 |
Total | 2.093 | 41458.312 | 2462 |
This file contains illustrative R code for computing important count distributions. When reviewing this code, you should open an R session, copy-and-paste the code, and see it perform. Then, you will be able to change parameters, look up commands, and so forth, as you go.
This sections shows how to compute and graph probability mass and distribution functions for the Poisson distribution.
lambda <- 3
N <- seq(0, 20, 1)
# Get the probability mass function using "dpois"
( fn <- dpois(N, lambda) )
[1] 4.978707e-02 1.493612e-01 2.240418e-01 2.240418e-01 1.680314e-01
[6] 1.008188e-01 5.040941e-02 2.160403e-02 8.101512e-03 2.700504e-03
[11] 8.101512e-04 2.209503e-04 5.523758e-05 1.274713e-05 2.731529e-06
[16] 5.463057e-07 1.024323e-07 1.807629e-08 3.012715e-09 4.756919e-10
[21] 7.135379e-11
# Visualize the probability mass function
plot(N, fn, xlab = "n", ylab = "f(n)")
A few quick notes on these commands.
<-
is analogous to an equal sign in mathematics. The command lambda <- 3
means to give a value of “3” to quantity lambda.seq
is short-hand for sequence.dpois
is a built-in command in R for generating the “density” (actually the mass) function of the Poisson distribution. Use the online help (help("dpois")
) to learn more about this function.(
, close paren )
tells R to display the output of a calculation to the screen.plot
is a very handy command for displaying results graphically.# Get the cumulative distribution function using "ppois"
( Fn <- ppois(N, lambda) )
[1] 0.04978707 0.19914827 0.42319008 0.64723189 0.81526324 0.91608206
[7] 0.96649146 0.98809550 0.99619701 0.99889751 0.99970766 0.99992861
[13] 0.99998385 0.99999660 0.99999933 0.99999988 0.99999998 1.00000000
[19] 1.00000000 1.00000000 1.00000000
# Visualize the cumulative distribution function
plot(N, Fn, xlab = "n", ylab = "F(n)") # cdf
This section shows how to compute and graph probability mass and distribution functions for the negative binomial distribution. You will also learn how to plot two functions on the same graph.
alpha <- 3
theta <- 2
prob <- 1 / (1 + theta)
N <- seq(0, 30, 1)
# Get the probability mass function using "dnbinom"
( fn <- dnbinom(N, alpha, prob) )
[1] 3.703704e-02 7.407407e-02 9.876543e-02 1.097394e-01 1.097394e-01
[6] 1.024234e-01 9.104303e-02 7.803688e-02 6.503074e-02 5.298801e-02
[11] 4.239041e-02 3.339850e-02 2.597661e-02 1.998201e-02 1.522439e-02
[16] 1.150287e-02 8.627153e-03 6.428075e-03 4.761537e-03 3.508501e-03
[21] 2.572901e-03 1.878626e-03 1.366273e-03 9.900532e-04 7.150384e-04
[26] 5.148277e-04 3.696199e-04 2.646661e-04 1.890472e-04 1.347233e-04
[31] 9.580323e-05
# Visualize the probability mass function
plot(N, fn, xlab = "n", ylab = "f(n)") # pmf
# Plot different negative binomial distributions on the same figure
alpha_1 <- 3
alpha_2 <- 5
theta <- 2
prob <- 1 / (1 + theta)
fn_1 <- dnbinom(N, alpha_1, prob)
fn_2 <- dnbinom(N, alpha_2, prob)
plot(N, fn_1, xlab = "n", ylab = "f(n)")
lines(N, fn_2, col = "red", type = "p")
A couple notes on these commands:
;
semi-colon.lines
is very handy for superimposing one graph on another.col = "red"
tells R to use the color red when plotting symbols.# Get the distribution function using "pnbinom"
( Fn <- pnbinom(N, alpha, prob) )
[1] 0.03703704 0.11111111 0.20987654 0.31961591 0.42935528 0.53177869
[7] 0.62282172 0.70085861 0.76588935 0.81887735 0.86126776 0.89466626
[13] 0.92064288 0.94062489 0.95584927 0.96735214 0.97597930 0.98240737
[19] 0.98716891 0.99067741 0.99325031 0.99512894 0.99649521 0.99748526
[25] 0.99820030 0.99871513 0.99908475 0.99934942 0.99953846 0.99967319
[31] 0.99976899
plot(N, Fn, xlab = "n", ylab = "F(n)") # cdf
This section shows how to compute and graph probability mass and distribution functions for the binomial distribution.
# Plot different binomial distributions on the same figure
size <- 30
prob <- 0.6
N <- seq(0, 30, 1)
fn <- dbinom(N, size, prob)
plot(N, fn, xlab = "n", ylab = "f(n)") # pdf
fn2 <- dbinom(N, size, 0.7)
lines(N, fn2, col = "red", type = "p")
# Get the distribution function using "pbinom"
( Fn <- pbinom(N, size, prob) )
[1] 1.152922e-12 5.303439e-11 1.181456e-09 1.697936e-08 1.769332e-07
[6] 1.424573e-06 9.222321e-06 4.932503e-05 2.222679e-04 8.563920e-04
[11] 2.853883e-03 8.301584e-03 2.123988e-02 4.811171e-02 9.705684e-02
[16] 1.753691e-01 2.854956e-01 4.215343e-01 5.689095e-01 7.085281e-01
[21] 8.237135e-01 9.059888e-01 9.564759e-01 9.828170e-01 9.943412e-01
[26] 9.984899e-01 9.996867e-01 9.999526e-01 9.999954e-01 9.999998e-01
[31] 1.000000e+00
plot(N, Fn, xlab = "n", ylab = "F(n)") # cdf
This section shows how to compute recursively a distribution in the (a,b,0) class. The specific example is a Poisson. However, by changing values of a and b, you can use the same recursion for negative binomial and binomial, the other two members of the (a,b,0) class.
lambda <- 3
a <- 0
b <- lambda
# This loop calculates the (a,b,0) recursive probabilities for the Poisson distribution
p <- rep(0, 20)
# Get the probability at n = 0 to start the recursive formula
p[1] <- exp(-lambda)
for (i in 1:19)
{
p[i+1] <- (a + b / i) * p[i] # Probability of i-th element using the ab0 formula
}
p
[1] 4.978707e-02 1.493612e-01 2.240418e-01 2.240418e-01 1.680314e-01
[6] 1.008188e-01 5.040941e-02 2.160403e-02 8.101512e-03 2.700504e-03
[11] 8.101512e-04 2.209503e-04 5.523758e-05 1.274713e-05 2.731529e-06
[16] 5.463057e-07 1.024323e-07 1.807629e-08 3.012715e-09 4.756919e-10
# Check using the "dpois" command
dpois(seq(0, 20, 1), lambda = 3)
[1] 4.978707e-02 1.493612e-01 2.240418e-01 2.240418e-01 1.680314e-01
[6] 1.008188e-01 5.040941e-02 2.160403e-02 8.101512e-03 2.700504e-03
[11] 8.101512e-04 2.209503e-04 5.523758e-05 1.274713e-05 2.731529e-06
[16] 5.463057e-07 1.024323e-07 1.807629e-08 3.012715e-09 4.756919e-10
[21] 7.135379e-11
A couple notes on these commands.
exp
for exponentials.for
loop, one of many ways of doing recursive calculations.This section loads the SingaporeAuto.csv
dataset and checks the names of variables and the dimensions of the data. To have a glimpse at the data, the first 8 observations are listed.
Singapore <- read.csv("Data/SingaporeAuto.csv", quote = "", header = TRUE)
# Check the names, dimensions in the file and list the first 8 observations ;
names(Singapore)
[1] "SexInsured" "Female" "VehicleType" "PC" "Clm_Count"
[6] "Exp_weights" "LNWEIGHT" "NCD" "AgeCat" "AutoAge0"
[11] "AutoAge1" "AutoAge2" "AutoAge" "VAgeCat" "VAgecat1"
dim(Singapore) # check number of observations and variables in the data
[1] 7483 15
Singapore[1:4, ] # list the first 4 observations
SexInsured Female VehicleType PC Clm_Count Exp_weights LNWEIGHT NCD
1 U 0 T 0 0 0.6680356 -0.40341383 30
2 U 0 T 0 0 0.5667351 -0.56786326 30
3 U 0 T 0 0 0.5037645 -0.68564629 30
4 U 0 T 0 0 0.9144422 -0.08944106 20
AgeCat AutoAge0 AutoAge1 AutoAge2 AutoAge VAgeCat VAgecat1
1 0 0 0 0 0 0 2
2 0 0 0 0 0 0 2
3 0 0 0 0 0 0 2
4 0 0 0 0 0 0 2
attach(Singapore) # attach dataset
A few quick notes on these commands:
names()
function is used to get or assign names of an object. In this illustration, it was used to get the variables names.dim()
function is used to retrieve or set the dimensions of an object.attach()
function, variable names in the database can be accessed by simply giving their names.The table below gives the distribution of observed claims frequency. The Clm_Count
variable is the number of automobile accidents per policyholder.
table(Clm_Count)
Clm_Count
0 1 2 3
6996 455 28 4
( n <- length(Clm_Count) ) # number of insurance policies
[1] 7483
Before maximizing, let us start by visualizing the logarithmic likelihood function. We will fit the claim counts for the Singapore data to the Poisson model. As an illustration, first assume that \(\lambda = 0.5\). The claim count, likelihood, and its logarithmic version, for five observations is
# Five typical observations
Clm_Count[2245:2249]
[1] 3 0 1 0 3
# Probabilities
dpois(Clm_Count[2245:2249], lambda = 0.5)
[1] 0.01263606 0.60653066 0.30326533 0.60653066 0.01263606
# Logarithmic probabilities
log(dpois(Clm_Count[2245:2249], lambda = 0.5))
[1] -4.371201 -0.500000 -1.193147 -0.500000 -4.371201
By hand, you can check that the sum of log likelihoods for these five observations is -10.9355492. In the same way, the sum of all 7483 observations is
sum(log(dpois(Clm_Count, lambda = 0.5)))
[1] -4130.591
Of course, this is only for the choice \(\lambda = 0.5\). The following code defines the log likelihood to be a function of \(\lambda\) and plots the function for several choices of \(\lambda\):
loglikPois <- function (parms){
# Defines the Poisson loglikelihood function
lambda = parms[1]
llk <- sum(log(dpois(Clm_Count, lambda)))
llk
}
lambdax <- seq(0, .2, .01)
loglike <- 0 * lambdax
for (i in 1:length(lambdax))
{
loglike[i] <- loglikPois(lambdax[i])
}
plot(lambdax, loglike)
If we had to guess, from this plot we might say that the maximum value of the log likelihood was around 0.07.
From calculus, we know that the maximum likelihood estimator (mle) of the Poisson distribution parameter equals the average claim count. For our data, this is
mean(Clm_Count)
[1] 0.06989175
As an alternative, let us use an optimization routine nlminb
. Most optimization routines try to minimize functions instead of maximize them, so we first define the negative loglikelihood function.
negloglikPois <- function (parms){
# Defines the (negative) Poisson loglikelihood function
lambda <- parms[1]
llk <- -sum(log(dpois(Clm_Count, lambda)))
llk
}
ini.Pois <- 1
zop.Pois <- nlminb(ini.Pois, negloglikPois, lower = c(1e-6), upper = c(Inf))
print(zop.Pois) # In output, $par = MLE of lambda, $objective = - loglikelihood value
$par
[1] 0.06989175
$objective
[1] 1941.178
$convergence
[1] 0
$iterations
[1] 17
$evaluations
function gradient
23 20
$message
[1] "relative convergence (4)"
So, the maximum likelihood estimate, zop.Pois$par =
0.0698918 is exactly the same as the value that we got by hand.
Because actuarial analysts calculate Poisson mle’s so regularly, here is another way of doing the calculation using the glm
, generalized linear model, package.
count_poisson1 <- glm(Clm_Count ~ 1, poisson(link = log))
summary(count_poisson1)
Call:
glm(formula = Clm_Count ~ 1, family = poisson(link = log))
Deviance Residuals:
Min 1Q Median 3Q Max
-0.3739 -0.3739 -0.3739 -0.3739 4.0861
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.66081 0.04373 -60.85 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2887.2 on 7482 degrees of freedom
Residual deviance: 2887.2 on 7482 degrees of freedom
AIC: 3884.4
Number of Fisher Scoring iterations: 6
( lambda_hat <- exp(count_poisson1$coefficients) )
(Intercept)
0.06989175
A few quick notes on these commands and results:
glm()
function is used to fit generalized linear models. See help(glm)
for other modeling options. In order to get the results we use the summary()
function.call
reminds us what model we ran and what options were specified.Deviance Residuals
shows the distribution of the deviance residuals for individual cases used in the model.lambda_hat <- exp(count_poisson1$coefficients)
.In the same way, here is code for determining the maximum likelihood estimates for the negative binomial distribution.
dnb <- function (y, r, beta){
# Defines the (negative) negative binomial loglikelihood function
gamma(y + r) / gamma(r) / gamma(y + 1) * (1 / (1 + beta))^r * (beta / (1 + beta))^y
}
loglikNB <- function (parms){
r = parms[1]
beta = parms[2]
llk <- -sum(log(dnb(Clm_Count, r, beta)))
llk
}
ini.NB <- c(1, 1)
zop.NB <- nlminb(ini.NB, loglikNB, lower = c(1e-6, 1e-6), upper = c(Inf, Inf))
print(zop.NB) # In output, $par = (MLE of r, MLE of beta), $objective = - loglikelihood value
$par
[1] 0.87401622 0.07996624
$objective
[1] 1932.383
$convergence
[1] 0
$iterations
[1] 24
$evaluations
function gradient
30 60
$message
[1] "relative convergence (4)"
Two quick notes:
This section shows how to check the adequacy of the Poisson and negative binomial models for the Singapore data.
First, note that the variance for the count data is 0.0757079 which is greater than the mean value, 0.0698918. This suggests that the negative binomial model is preferred to the Poisson model.
Second, we will compute the Pearson goodness-of-fit statistic.
The table below gives the distribution of fitted claims frequency using Poisson distribution \(n \times p_k\)
table_1p = cbind(n * (dpois(0, lambda_hat)),
n * (dpois(1, lambda_hat)),
n * (dpois(2, lambda_hat)),
n * (dpois(3, lambda_hat)),
n * (1 - ppois(3, lambda_hat)))
# or n * (1 - dpois(0,lambda_hat) - dpois(1, lambda_hat) -
# dpois(2, lambda_hat) - dpois(3, lambda_hat)))
actual <- data.frame(table(Clm_Count))[,2];
actual[5] <- 0 # assign 0 to claim counts greater than or equal to 4 in observed data
table_2p <- rbind(c(0, 1, 2, 3, "4+"), actual, round(table_1p, digits = 2))
rownames(table_2p) <- c("Number","Actual", "Estimated Using Poisson")
table_2p
[,1] [,2] [,3] [,4] [,5]
Number "0" "1" "2" "3" "4+"
Actual "6996" "455" "28" "4" "0"
Estimated Using Poisson "6977.86" "487.69" "17.04" "0.4" "0.01"
For goodness of fit, consider Pearson’s chi-square statistic below. The degrees of freedom (df) equals the number of cells minus one minus the number of estimated parameters.
# PEARSON GOODNESS-OF-FIT STATISTIC
diff = actual - table_1p
( Pearson_p <- sum(diff * diff / table_1p) )
[1] 41.98438
# p-value
1 - pchisq(Pearson_p, df = 5 - 1 - 1)
[1] 4.042861e-09
The large value of the goodness of fit statistic 41.984382 or the small p value indicates that there is a large difference between actual counts and those anticipated under the Poisson model.
Here is another way of determining the maximum likelihood estimator of the negative binomial distribution.
library(MASS)
fm_nb <- glm.nb(Clm_Count ~ 1, link = log)
summary(fm_nb)
Call:
glm.nb(formula = Clm_Count ~ 1, link = log, init.theta = 0.8740189897)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.3667 -0.3667 -0.3667 -0.3667 3.4082
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.66081 0.04544 -58.55 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(0.874) family taken to be 1)
Null deviance: 2435.5 on 7482 degrees of freedom
Residual deviance: 2435.5 on 7482 degrees of freedom
AIC: 3868.8
Number of Fisher Scoring iterations: 1
Theta: 0.874
Std. Err.: 0.276
2 x log-likelihood: -3864.767
With these new estimates (or you could use the general procedure we introduced earlier), we can produce a table of counts and fitted counts and use this to calculate the goodness-of-fit statistic.
fm_nb$theta
[1] 0.874019
beta <- exp(fm_nb$coefficients) / fm_nb$theta
prob <- 1/(1+beta)
table_1nb = cbind(n * (dnbinom(0, size = fm_nb$theta, prob)),
n * (dnbinom(1, size = fm_nb$theta, prob)),
n * (dnbinom(2, size = fm_nb$theta, prob)),
n * (dnbinom(3, size = fm_nb$theta, prob)),
n * (dnbinom(4, size = fm_nb$theta, prob)))
table_2nb <- rbind(c(0, 1, 2, 3, "4+"), actual, round(table_1nb, digits = 2))
rownames(table_2nb) <- c("Number","Actual", "Estimated Using Neg Bin")
table_2nb
[,1] [,2] [,3] [,4] [,5]
Number "0" "1" "2" "3" "4+"
Actual "6996" "455" "28" "4" "0"
Estimated Using Neg Bin "6996.4" "452.78" "31.41" "2.23" "0.16"
# PEARSON GOODNESS-OF-FIT STATISTIC
diff = actual - table_1nb
( Pearson_nb = sum(diff * diff / table_1nb) )
[1] 1.95024
# p-value
1 - pchisq(Pearson_nb, df = 5 - 2 - 1)
[1] 0.3771472
The small value of the goodness of fit statistic 1.9502395 or the high p value 0.3771472 both indicate that the negative binomial provides a better fit to the data than the Poisson.
This file contains illustrative R code for computing important count distributions. When reviewing this code, you should open an R session, copy-and-paste the code, and see it perform. Then, you will be able to change parameters, look up commands, and so forth, as you go.
library(actuar)
library(VGAM)
This section demonstrates the effect of the shape and scale parameters on the gamma density.
The graph shows the gamma density functions with varying shape parameters \((\alpha)\)
# Example 1: gamma distribution
# Define a grid
x <- seq(0, 1000, by = 1)
# Define a set of scale and shape parameters
scale_param <- seq(100, 250, by = 50)
shape_param <- 2:5
# Varying the shape parameter
plot(x, dgamma(x, shape = shape_param[1], scale = 100), type = "l", ylab = "Gamma Density")
for (k in 2:length(shape_param)) {
lines(x, dgamma(x, shape = shape_param[k], scale = 100), col = k)
}
legend("topright", c(expression(alpha ~ '= 2'), expression(alpha ~ '= 3'),
expression(alpha ~ '= 4'), expression(alpha ~ '= 5')),
lty = 1, col = 1:4)
title(substitute(paste("Pdf Gamma Density with"," ",theta," = 100"," ",
"and Varying Shape")))
A few quick notes on these commands :
seq
is short-hand for sequencedgamma
function is used for density of the Gamma distribution with shape and scale parameters .plot
is a very handy command for displaying results graphically.lines()
function is used to add plots to an already existing graph.legend
function can be used to add legends to plots.The graph shows the gamma density functions with varying scale parameters \((\theta)\)
plot(x, dgamma(x, shape = 2, scale = scale_param[1]), type = "l", ylab = "Gamma Density")
for (k in 2:length(scale_param)) {
lines(x, dgamma(x, shape = 2, scale = scale_param[k]), col = k)
}
legend("topright", c(expression(theta ~ '= 100'), expression(theta ~ '= 150'),
expression(theta ~'= 200'), expression(theta ~ '= 250')),
lty = 1, col = 1:4)
title(substitute(paste("Pdf Gamma Density with"," ", alpha, " = 2", " ",
"and Varying Scale")))
This section demonstrates the effect of the shape and scale parameters on the Pareto density function.
The graph shows the Pareto density functions with varying shape parameters \((\alpha)\)
z <- seq(0, 3000, by = 1)
scale_param <- seq(2000, 3500, 500)
shape_param <- 1:4
# Varying the shape parameter
plot(z, dparetoII(z, loc = 0, shape = shape_param[1], scale = 2000),
ylim = c(0, 0.002),type = "l", ylab = "Pareto Density")
for (k in 2:length(shape_param)) {
lines(z, dparetoII(z,loc = 0, shape = shape_param[k], scale = 2000), col = k)
}
legend("topright", c(expression(alpha ~ '= 1'), expression(alpha ~ '= 2'),
expression(alpha ~ '= 3'), expression(alpha ~ '= 4')),
lty = 1, col = 1:4)
title(substitute(paste("Pdf Pareto Density with"," ",theta," = 2000"," ",
"and Varying Shape")))
The graph shows the Pareto density functions with varying scale parameters \((\theta)\)
plot(z, dparetoII(z, loc = 0, shape = 3, scale = scale_param[1]),
type = "l", ylab = "Pareto Density")
for (k in 2:length(scale_param)) {
lines(z, dparetoII(z, loc = 0, shape = 3, scale = scale_param[k]), col = k)
}
legend("topright", c(expression(theta ~ '= 2000'), expression(theta ~ '= 2500'),
expression(theta ~ '= 3000'), expression(theta ~ '= 3500')),
lty = 1, col = 1:4)
title(substitute(paste("Pdf Pareto Density with"," ",alpha," = 3"," ",
"and Varying Scale")))
This section demonstrates the effect of the shape and scale parameters on the Weibull density function.
The graph shows the Weibull density function with varying shape parameters \((\alpha)\)
z <- seq(0, 400, by = 1)
scale_param <- seq(50, 200, 50)
shape_param <- seq(1.5, 3, 0.5)
# Varying the shape parameter
plot(z, dweibull(z, shape = shape_param[1], scale = 100),
ylim = c(0, 0.012), type = "l", ylab = "Weibull Density")
for (k in 2:length(shape_param)) {
lines(z, dweibull(z, shape = shape_param[k], scale = 100), col = k)
}
legend("topright", c(expression(alpha ~ '= 1.5'), expression(alpha ~ '= 2'),
expression(alpha ~'= 2.5'), expression(alpha ~ '= 3')),
lty = 1, col = 1:4)
title(substitute(paste("Pdf Weibull Density with"," ",theta," = 100"," ",
"and Varying Shape")))
The graph shows the Weibull density function with varying scale parameters \((\theta)\)
plot(z, dweibull(z, shape = 3, scale = scale_param[1]),
type = "l", ylab = "Weibull Density")
for(k in 2:length(scale_param)){
lines(z,dweibull(z, shape = 3, scale = scale_param[k]), col = k)
}
legend("topright", c(expression(theta ~ '= 50'), expression(theta ~ '= 100'),
expression(theta ~ '= 150'), expression(theta ~ '= 200')),
lty = 1, col = 1:4)
title(substitute(paste("Pdf Weibull Density with", " ", alpha, " = 3", " ",
"and Varying Scale")))
This section demonstrates the effect of the shape and scale parameters on the GB2 density function.
The graph shows the GB2 density function with varying scale parameter \((\theta)\)
# Example 4:GB2
gb2_density <- function (x, shape_1, shape_2, shape_3, scale){
mu <- log(scale)
sigma <- 1 / shape_3
xt <- (log(x) - mu) / sigma
logexpxt <- ifelse (xt > 23, yt, log(1 + exp(xt)))
logdens <- shape_1 * xt - log(sigma) - log(beta(shape_1, shape_2)) -
(shape_1 + shape_2) * logexpxt - log(x)
exp(logdens)
}
x <- seq(0, 400, by = 1)
alpha_1 <- 5
alpha_2 <- 4
gamma <- 2
theta <- seq(150, 250, 50)
# Varying the scale parameter
plot(x,
gb2_density(x, shape_1 = alpha_1, shape_2 = alpha_2, shape_3 = gamma,
scale = theta[1]), type = "l", ylab = "Gen Beta 2 Density",
main = expression(paste("GB2 Density with ", alpha[1], " = 5, ",
alpha[2], " = 4, ", alpha[3], " = 2, ",
"and Varying Scale (", theta, ") Parameters")),
cex.main = 0.95 )
for(k in 2:length(theta)){
lines(x, gb2_density(x, shape_1 = alpha_1, shape_2 = alpha_2, shape_3 = gamma,
scale = theta[k]), col = k)
}
legend("topleft", c(expression(theta ~ '= 150'), expression(theta ~ '= 200'),
expression(theta ~ '= 250')), lty = 1, cex = 0.6, col = 1:3)
Note: Here we wrote our own function for the density function of the GB2 density function.
This section shows some of the methods of creating new distributions.
The graph below creates a density function from two random variables that follow a gamma distribution.
# Example 5: A mixed density
# Specify density of a mixture of 2 gamma distributions
mixture_gamma_density <- function (x, a_1, a_2, alpha_gamma1, theta_gamma1, alpha_gamma2,
theta_gamma2){
a_1 * dgamma(x, shape = alpha_gamma1, scale = theta_gamma1) + a_2 *
dgamma(x, shape = alpha_gamma2, scale = theta_gamma2)
}
w <- 1:30000 / 100
a_1 <- 0.5
a_2 <- 0.5
alpha_1 <- 4
theta_1 <- 7
alpha_2 <- 15
theta_2 <- 7
mix_gamma_dens <- mixture_gamma_density(w, a_1, a_2, alpha_1, theta_1, alpha_2, theta_2)
plot(w, mix_gamma_dens, type = "l")
The graph below shows a density function through splicing by combining an exponential distribution on \((0,c)\) with a Pareto distribution on \((c,\infty)\)
# Example 6: density obtained through splicing
# Combine an Exp on (0,c) with a Pareto on (c,\infty)
splice_exp_par <- function (x, c, v, theta, gamma, alpha){
if (0 <= x & x < c) {return(v * dexp(x, 1 / theta) / pexp(c, 1 / theta))} else
if (x >= c) {return((1 - v) * dparetoII(x, loc = 0, shape = alpha, scale = theta) /
(1 - pparetoII(x, loc = 0, shape = alpha, scale = theta)))}
}
x <- t(as.matrix(1:2500 / 10))
splice_values <- apply(x, 2, splice_exp_par, c = 100, v = 0.6,
theta = 100, gamma = 200, alpha = 4)
plot(x, splice_values, type = 'l')
The actuar
package provides functions for dealing with coverage modifications. In the following sections we will check the functionalities of the coverage
command.
library(actuar)
This section plots the modified probability density functions due to deductibles for the payment per loss and payment per payment random variables.
Let \(X\) be the random variable for loss size. The random variable for the payment per loss with deductible \(d\) is \(Y^L=(X-d)_+\). The plot of the modified probability density function is below.
f <- coverage(dgamma, pgamma, deductible = 1, per.loss = TRUE) # create the object
mode(f) # it's a function. Here deductible is 1
[1] "function"
# Check the pdf for Y^L at 0 and the original loss at 1
f(0, 3) # mass at 0
[1] 0.0803014
pgamma(0 + 1, 3) # idem
[1] 0.0803014
curve(dgamma(x, 3), from = 0, to = 10, ylim = c(0, 0.3), lwd = 1, col = "gray") # original
curve(dgamma(x, 3), from = 1, to = 10, ylim = c(0, 0.3), lwd = 2, add = TRUE)
curve(f(x, 3), from = 0.01, col = "blue", add = TRUE, lwd = 2) # modified
points(0, f(0, 3), pch = 16, col = "blue")
legend("topright", c("Original pdf", "Modified pdf"),
lty = 1, cex = 0.6, col = c("black","blue"))
A few quick notes on these commands:
coverage()
function computes probability density function or cumulative distribution function of the payment per payment or payment per loss random variable under any combination of the following coverage modifications: deductible, limit, coinsurance, inflation. In this illustration we used it to compute the probability density function of the payment per loss random variable with a deductible of 1.f(0, 3)
function calculates the pdf when the payment per loss variable is 0 with gamma parameters shape=3 and rate=1. Because we used a deductible of 1 , this should be equal to pgamma(0 + 1, 3)
.\(Y^P\) with pdf \(f_{Y^P}(y) = f_X(y+d)/S_X(d)\)
f <- coverage(dgamma, pgamma, deductible = 1) # create the object
f(0, 3) # calculate in x = 0, shape = 3, rate = 1
[1] 0
f(5, 3) # calculate in x = 5, shape = 3, rate = 1
[1] 0.04851322
dgamma(5 + 1, 3) / pgamma(1, 3, lower = FALSE) # DIY
[1] 0.04851322
curve(dgamma(x, 3), from = 0, to = 10, ylim = c(0, 0.3),
lwd = 1, col = "gray") # original pdf
curve(dgamma(x, 3), from = 1, to = 10, ylim = c(0, 0.3), add = TRUE, lwd = 2)
curve(f(x, 3), from = 0.01, col = "blue",
add = TRUE, lwd = 2) # modified pdf
legend("topright", c("Original pdf", "Modified pdf"),
lty = 1, cex = 0.6, col = c("black","blue"))
f <- coverage(dgamma, pgamma, deductible = 1, limit = 100,
coinsurance = 0.9, inflation = 0.05)
# create the object
f(0, 3) # calculate in x = 0, shape = 3, rate = 1
[1] 0
f(5, 3) # calculate in x = 5, shape = 3, rate = 1
[1] 0.0431765
curve(dgamma(x, 3), from = 0, to = 10, ylim = c(0, 0.3),
lwd = 1, col = "gray") # original pdf
curve(dgamma(x, 3), from = 1, to = 10, ylim = c(0, 0.3), add = TRUE, lwd = 2)
curve(f(x, 3), from = 0.01, col = "blue", add = TRUE, lwd = 2) # modified pdf
legend("topright", c("Original pdf", "Modified pdf"),
lty = 1, cex = 0.6, col = c("black", "blue"))
A policy with a franchise deductible of \(d\) pays nothing if the loss is no greater than \(d\), and pays the full amount of the loss if it is greater than \(d\). This section plots the pdf for the per payment and per loss random variable.
# Franchise deductible
# Per loss variable
f <- coverage(dgamma, pgamma, deductible = 1, per.loss = TRUE, franchise = TRUE)
f(0, 3) # mass at 0
[1] 0.0803014
pgamma(1, 3) # idem
[1] 0.0803014
f(0.5, 3) # 0 < x < 1
[1] 0
f(1, 3) # x = 1
[1] 0
f(5, 3) # x > 1
[1] 0.08422434
dgamma(5, 3)
[1] 0.08422434
curve(dgamma(x, 3), from = 0, to = 10, ylim = c(0, 0.3)) # original
curve(f(x, 3), from = 1.1, col = "blue", add = TRUE) # modified
points(0, f(0, 3), pch = 16, col = "blue") # mass at 0
curve(f(x, 3), from = 0.1, to = 1, col = "blue", add = TRUE) # 0 < x < 1
legend("topright", c("Original pdf", "Modified pdf"),
lty = 1, cex = 0.6, col = c("black", "blue"))
Note : To use the franchise deductible , we have to add the option franchise = TRUE
in the coverage function.
# Franchise deductible
# Per payment variable
f <- coverage(dgamma, pgamma, deductible = 1, franchise = TRUE)
f(0, 3) # x = 0
[1] 0
f(0.5, 3) # 0 < x < 1
[1] 0
f(1, 3) # x = 1
[1] 0
f(5, 3) # x > 1
[1] 0.09157819
dgamma(5, 3) / pgamma(1, 3, lower = FALSE) # idem
[1] 0.09157819
curve(dgamma(x, 3), from = 0, to = 10, ylim = c(0, 0.3)) # original
curve(f(x, 3), from = 1.1, col = "blue", add = TRUE) # modified
curve(f(x, 3), from = 0, to = 1, col = "blue", add = TRUE) # 0 < x < 1
legend("topright", c("Original pdf", "Modified pdf"),
lty = 1, cex = 0.6, col = c("black", "blue"))
This file contains illustrative R code for computing important count distributions. When reviewing this code, you should open an R session, copy-and-paste the code, and see it perform. Then, you will be able to change parameters, look up commands, and so forth, as you go.This code uses the dataset CLAIMLEVEL.csv
This section summarizes claims from the property fund for year 2010 and plots the data.
The results below considers individual claims from the property fund for year 2010.
## Read in data and get number of claims.
claim_lev <- read.csv("Data/CLAIMLEVEL.csv", header = TRUE)
nrow(claim_lev) # 6258
[1] 6258
# 2010 subset
claim_data <- subset(claim_lev, Year == 2010);
length(unique(claim_data$PolicyNum)) # 403 unique policyholders
[1] 403
n_tot <- nrow(claim_data) # 1377 individual claims
n_tot
[1] 1377
# As an alternative, you can simulate claims
# n_tot <- 13770
# alpha_hat <- 2
# theta_hat <- 100
# claim <- rgamma(n_tot, shape = alpha_hat, scale = theta_hat)
# claim <- rparetoII(n_tot, loc = 0, shape = alpha_hat, scale = theta_hat)
# GB2
# claim <- theta_hat * rgamma(n_tot, shape = alpha_hat, scale = 1) /
# rgamma(n_tot, shape = 1, scale = 1)
# claim_data <- data.frame(claim)
###################################################
The output below provides summary on claims data for 2010 and summary in logarithmic units.
# Summarizing the claim data for 2010
summary(claim_data$Claim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 789 2250 26623 6171 12922218
sd(claim_data$Claim)
[1] 368029.7
# Summarizing logarithmic claims for 2010
summary(log(claim_data$Claim))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 6.670 7.719 7.804 8.728 16.374
sd(log(claim_data$Claim))
[1] 1.683297
The plots below provides further information about the distribution of sample claims.
# Histogram
par(mfrow = c(1, 2))
hist(claim_data$Claim, main="", xlab = "Claims")
hist(log(claim_data$Claim), main = "", xlab = "Logarithmic Claims")
# dev.off()
This section shows how to fit basic distributions to a data set.
The results below assume that the data follow a lognormal distribution and usesVGAM
library for estimation of parameters.
# Inference assuming a lognormal distribution
# First, take the log of the data and assume normality
y <- log(claim_data$Claim)
summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 6.670 7.719 7.804 8.728 16.374
sd(y)
[1] 1.683297
# Confidence intervals and hypothesis test
t.test(y, mu = log(5000)) # H0: mu_o = log(5000) = 8.517
One Sample t-test
data: y
t = -15.717, df = 1376, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 8.517193
95 percent confidence interval:
7.715235 7.893208
sample estimates:
mean of x
7.804222
# Mean of the lognormal distribution
exp(mean(y) + sd(y)^2 / 2)
[1] 10106.82
mean(claim_data$Claim)
[1] 26622.59
# Alternatively, assume that the data follow a lognormal distribution
# Use "VGAM" library for estimation of parameters
library(VGAM)
fit.LN <- vglm(Claim ~ 1, family = lognormal, data = claim_data)
summary(fit.LN)
Call:
vglm(formula = Claim ~ 1, family = lognormal, data = claim_data)
Pearson residuals:
Min 1Q Median 3Q Max
meanlog -4.6380 -0.6740 -0.05083 0.5487 5.093
loge(sdlog) -0.7071 -0.6472 -0.44003 0.1135 17.636
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 7.80422 0.04535 172.10 <2e-16 ***
(Intercept):2 0.52039 0.01906 27.31 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of linear predictors: 2
Names of linear predictors: meanlog, loge(sdlog)
Log-likelihood: -13416.87 on 2752 degrees of freedom
Number of iterations: 3
No Hauck-Donner effect found in any of the estimates
coef(fit.LN) # coefficients
(Intercept):1 (Intercept):2
7.8042218 0.5203908
confint(fit.LN, level = 0.95) # confidence intervals for model parameters
2.5 % 97.5 %
(Intercept):1 7.7153457 7.8930978
(Intercept):2 0.4830429 0.5577387
logLik(fit.LN) # loglikelihood for lognormal
[1] -13416.87
AIC(fit.LN) # AIC for lognormal
[1] 26837.74
BIC(fit.LN) # BIC for lognormal
[1] 26848.2
vcov(fit.LN) # covariance matrix for model parameters
(Intercept):1 (Intercept):2
(Intercept):1 0.002056237 0.0000000000
(Intercept):2 0.000000000 0.0003631082
# Mean of the lognormal distribution
exp(mean(y) + sd(y)^2 / 2)
[1] 10106.82
exp(coef(fit.LN))
(Intercept):1 (Intercept):2
2450.927448 1.682685
A few quick notes on these commands:
t.test()
function can be used for a variety of t-tests. In this illustration, it was used to test \(H_0=\mu_0=\log(5000)=8.517\).vglm()
function is used to fit vector generalized linear models (VGLMs). See help(vglm)
for other modeling options.coef()
function returns the estimated coefficients from the vglm
or other modeling functions.confint
function provides the confidence intervals for model parameters.loglik
function provides the log-likelihood value for the lognormal estimation from the vglm
or other modeling functions.AIC()
and BIC()
returns Akaike’s Information Criterion and BIC or SBC (Schwarz’s Bayesian criterion) for the fitted lognormal model. \(\text{AIC} =-2* \text{(loglikelihood)} + 2*\text{npar}\) , where npar
represents the number of parameters in the fitted model, and \(\text{BIC} =-2* \text{log-likelihood} + \log(n)* \text{npar}\) where \(n\) is the number of observations.vcov()
returns the covariance matrix for model parameters.The results below assume that the data follow a gamma distribution and usesVGAM
library for estimation of parameters.
# Inference assuming a gamma distribution
# Install.packages("VGAM")
library(VGAM)
fit.gamma <- vglm(Claim ~ 1, family = gamma2, data = claim_data)
summary(fit.gamma)
Call:
vglm(formula = Claim ~ 1, family = gamma2, data = claim_data)
Pearson residuals:
Min 1Q Median 3Q Max
loge(mu) -0.539 -0.5231 -0.4935 -0.4141 261.117
loge(shape) -153.990 -0.1024 0.2335 0.4969 0.772
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 10.18952 0.04999 203.82 <2e-16 ***
(Intercept):2 -1.23582 0.03001 -41.17 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of linear predictors: 2
Names of linear predictors: loge(mu), loge(shape)
Log-likelihood: -14150.59 on 2752 degrees of freedom
Number of iterations: 13
No Hauck-Donner effect found in any of the estimates
coef(fit.gamma) # This uses a different parameterization
(Intercept):1 (Intercept):2
10.189515 -1.235822
( theta <- exp(coef(fit.gamma)[1]) / exp(coef(fit.gamma)[2]) ) # theta = mu / alpha
(Intercept):1
91613.78
( alpha <- exp(coef(fit.gamma)[2]) )
(Intercept):2
0.2905959
plot(density(log(claim_data$Claim)), main = "", xlab = "Log Expenditures")
x <- seq(0, 15, by = 0.01)
fgamma_ex <- dgamma(exp(x), shape = alpha, scale = theta) * exp(x)
lines(x, fgamma_ex, col = "blue")
confint(fit.gamma, level = 0.95) # confidence intervals for model parameters
2.5 % 97.5 %
(Intercept):1 10.091533 10.287498
(Intercept):2 -1.294648 -1.176995
logLik(fit.gamma) # loglikelihood for gamma
[1] -14150.59
AIC(fit.gamma) # AIC for gamma
[1] 28305.17
BIC(fit.gamma) # BIC for gamma
[1] 28315.63
vcov(fit.gamma) # covariance matrix for model parameters
(Intercept):1 (Intercept):2
(Intercept):1 0.002499196 0.0000000000
(Intercept):2 0.000000000 0.0009008397
# Here is a check on the formulas
# AIC using formula : -2 * (loglik) + 2 * (number of parameters)
-2 * (logLik(fit.gamma)) + 2 * (length(coef(fit.gamma)))
[1] 28305.17
# BIC using formula : -2 * (loglik) + (number of parameters) * (log(n))
-2 * (logLik(fit.gamma)) + length(coef(fit.gamma, matrix = TRUE)) * log(nrow(claim_data))
[1] 28315.63
# Alternatively, we could a gamma distribution using glm
library(MASS)
fit.gamma_2 <- glm(Claim ~ 1, data = claim_data, family = Gamma(link = log))
summary(fit.gamma_2, dispersion = gamma.dispersion(fit.gamma_2))
Call:
glm(formula = Claim ~ 1, family = Gamma(link = log), data = claim_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.287 -2.258 -1.764 -1.178 30.926
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 10.18952 0.04999 203.8 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 3.441204)
Null deviance: 6569.1 on 1376 degrees of freedom
Residual deviance: 6569.1 on 1376 degrees of freedom
AIC: 28414
Number of Fisher Scoring iterations: 14
( theta <- exp(coef(fit.gamma_2)) * gamma.dispersion(fit.gamma_2) ) #theta = mu / alpha
(Intercept)
91613.78
( alpha <- 1 / gamma.dispersion(fit.gamma_2) )
[1] 0.2905959
logLik(fit.gamma_2) # log - likelihood slightly different from vglm
'log Lik.' -14204.77 (df=2)
AIC(fit.gamma_2) # AIC
[1] 28413.53
BIC(fit.gamma_2) # BIC
[1] 28423.99
Note : The output from coef(fit.gamma)
uses the parameterization \(\mu = \theta * \alpha\). coef(fit.gamma)[1]
= \(\log(\mu)\) and coef(fit.gamma)[2]
= \(\log(\alpha)\), which implies , \(\alpha\) = exp(coef(fit.gamma)[2])
and \(\theta = \mu / \alpha\) = exp(coef(fit.gamma)[1]) / exp(coef(fit.gamma)[2])
.
The results below assume that the data follow a Pareto distribution and usesVGAM
library for estimation of parameters.
fit.pareto <- vglm(Claim ~ 1, paretoII, loc = 0, data = claim_data)
summary(fit.pareto)
Call:
vglm(formula = Claim ~ 1, family = paretoII, data = claim_data,
loc = 0)
Pearson residuals:
Min 1Q Median 3Q Max
loge(scale) -6.332 -0.8289 0.1875 0.8832 1.174
loge(shape) -10.638 0.0946 0.4047 0.4842 0.513
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 7.7329210 0.0933332 82.853 <2e-16 ***
(Intercept):2 -0.0008753 0.0538642 -0.016 0.987
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of linear predictors: 2
Names of linear predictors: loge(scale), loge(shape)
Log-likelihood: -13404.64 on 2752 degrees of freedom
Number of iterations: 5
No Hauck-Donner effect found in any of the estimates
head(fitted(fit.pareto))
[,1]
[1,] 2285.03
[2,] 2285.03
[3,] 2285.03
[4,] 2285.03
[5,] 2285.03
[6,] 2285.03
coef(fit.pareto)
(Intercept):1 (Intercept):2
7.7329210483 -0.0008752515
exp(coef(fit.pareto))
(Intercept):1 (Intercept):2
2282.2590626 0.9991251
confint(fit.pareto, level = 0.95) # confidence intervals for model parameters
2.5 % 97.5 %
(Intercept):1 7.5499914 7.9158507
(Intercept):2 -0.1064471 0.1046966
logLik(fit.pareto) # loglikelihood for Pareto
[1] -13404.64
AIC(fit.pareto) # AIC for Pareto
[1] 26813.29
BIC(fit.pareto) # BIC for Pareto
[1] 26823.74
vcov(fit.pareto) # covariance matrix for model parameters
(Intercept):1 (Intercept):2
(Intercept):1 0.008711083 0.004352904
(Intercept):2 0.004352904 0.002901350
The results below assume that the data follow an exponential distribution and usesVGAM
library for estimation of parameters.
fit.exp <- vglm(Claim ~ 1, exponential, data = claim_data)
summary(fit.exp)
Call:
vglm(formula = Claim ~ 1, family = exponential, data = claim_data)
Pearson residuals:
Min 1Q Median 3Q Max
loge(rate) -484.4 0.7682 0.9155 0.9704 1
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.18952 0.02695 -378.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of linear predictors: 1
Name of linear predictor: loge(rate)
Residual deviance: 6569.099 on 1376 degrees of freedom
Log-likelihood: -15407.96 on 1376 degrees of freedom
Number of iterations: 6
No Hauck-Donner effect found in any of the estimates
( theta = 1 / exp(coef(fit.exp)) )
(Intercept)
26622.59
# Can also fit using the "glm" package
fit.exp2 <- glm(Claim ~ 1, data = claim_data, family = Gamma(link = log))
summary(fit.exp2, dispersion = 1)
Call:
glm(formula = Claim ~ 1, family = Gamma(link = log), data = claim_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.287 -2.258 -1.764 -1.178 30.926
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 10.18952 0.02695 378.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 1)
Null deviance: 6569.1 on 1376 degrees of freedom
Residual deviance: 6569.1 on 1376 degrees of freedom
AIC: 28414
Number of Fisher Scoring iterations: 14
( theta <- exp(coef(fit.exp2)) )
(Intercept)
26622.59
The results below assume that the data follow a GB2 distribution and uses the maximum likelihood technique for parameter estimation.
# Inference assuming a GB2 Distribution - this is more complicated
# The likelihood functon of GB2 distribution (negative for optimization)
lik_gb2 <- function (param) {
a_1 <- param[1]
a_2 <- param[2]
mu <- param[3]
sigma <- param[4]
yt <- (log(claim_data$Claim) - mu) / sigma
logexpyt <- ifelse(yt > 23, yt, log(1 + exp(yt)))
logdens <- a_1 * yt - log(sigma) - log(beta(a_1,a_2)) -
(a_1+a_2) * logexpyt - log(claim_data$Claim)
return(-sum(logdens))
}
# "optim" is a general purpose minimization function
gb2_bop <- optim(c(1, 1, 0, 1), lik_gb2, method = c("L-BFGS-B"),
lower = c(0.01, 0.01, -500, 0.01),
upper = c(500, 500, 500, 500), hessian = TRUE)
# Estimates
gb2_bop$par
[1] 2.830928 1.202500 6.328981 1.294552
# Standard error
sqrt(diag(solve(gb2_bop$hessian)))
[1] 0.9997743 0.2918469 0.3901929 0.2190362
# t-statistics
( tstat <- gb2_bop$par / sqrt(diag(solve(gb2_bop$hessian))) )
[1] 2.831567 4.120313 16.220133 5.910217
# density for GB II
gb2_density <- function(x){
a_1 <- gb2_bop$par[1]
a_2 <- gb2_bop$par[2]
mu <- gb2_bop$par[3]
sigma <- gb2_bop$par[4]
xt <- (log(x) - mu) / sigma
logexpxt<-ifelse (xt > 23, yt, log(1 + exp(xt)))
logdens <- a_1 * xt - log(sigma) - log(beta(a_1, a_2)) -
(a_1+a_2) * logexpxt - log(x)
exp(logdens)
}
# AIC using formula : -2 * (loglik) + 2 * (number of parameters)
-2 * ( sum(log(gb2_density(claim_data$Claim))) ) + 2 * 4
[1] 26768.13
# BIC using formula : -2 * (loglik) + (number of parameters) * (log(n))
-2 *( sum(log(gb2_density(claim_data$Claim))) ) + 4 * log(nrow(claim_data))
[1] 26789.04
This section plots on a logarithmic scale, the smooth (nonparametric) density of claims and overlays the densities of the distributions considered above.
# None of these distributions is doing a great job....
plot(density(log(claim_data$Claim)), main = "", xlab = "Log Expenditures",
ylim = c(0 ,0.37))
x <- seq(0, 15, by = 0.01)
fexp_ex <- dgamma(exp(x), scale = exp(-coef(fit.exp)), shape = 1) * exp(x)
lines(x, fexp_ex, col = "red")
fgamma_ex <- dgamma(exp(x), shape = alpha, scale = theta) * exp(x)
lines(x, fgamma_ex, col = "blue")
fpareto_ex <- dparetoII(exp(x), loc = 0, shape = exp(coef(fit.pareto)[2]),
scale = exp(coef(fit.pareto)[1])) * exp(x)
lines(x, fpareto_ex, col = "purple")
flnorm_ex <- dlnorm(exp(x), mean = coef(fit.LN)[1],
sd = exp(coef(fit.LN)[2])) * exp(x)
lines(x, flnorm_ex, col = "lightblue")
# Density for GB II
gb2_density <- function (x) {
a_1 <- gb2_bop$par[1]
a_2 <- gb2_bop$par[2]
mu <- gb2_bop$par[3]
sigma <- gb2_bop$par[4]
xt <- (log(x) - mu) / sigma
logexpxt <- ifelse (xt > 23, yt, log(1 + exp(xt)))
logdens <- a_1 * xt - log(sigma) - log(beta(a_1, a_2)) -
(a_1+a_2) * logexpxt -log(x)
exp(logdens)
}
fGB2_ex = gb2_density(exp(x)) * exp(x)
lines(x, fGB2_ex, col="green")
legend("topleft", c("log(claim_data$Claim)", "Exponential", "Gamma", "Pareto",
"lognormal", "GB2"),
lty = 1, col = c("black","red","blue","purple","lightblue","green"))
This section illustrates non-parametric tools including moment estimators, empirical distribution function, quantiles and density estimators.
The \(kth\) moment \(EX^k\) is estimated by \(\frac{1}{n}\sum_{i=1}^{n}X_i^k\). When \(k=1\) then the estimator is called the sample mean. The central moment is defined as \(E(X-\mu)^k\). When \(k=2\), then the central moment is called variance. Below illustrates the mean and variance.
# Start with a simple example of ten points
( x_example <- c(10, rep(15,3), 20, rep(23,4), 30) )
[1] 10 15 15 15 20 23 23 23 23 30
# Summary
summary(x_example) # mean
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.0 15.0 21.5 19.7 23.0 30.0
sd(x_example)^2 # variance
[1] 34.45556
The graph below gives the empirical distribution function x_example
dataset.
percentiles_x_example <- ecdf(x_example)
# Empirical distribution function
plot(percentiles_x_example, main = "", xlab = "x")
The results below gives the quantiles.
# Quantiles
quantile(x_example)
0% 25% 50% 75% 100%
10.0 15.0 21.5 23.0 30.0
# Quantiles : set you own probabilities
quantile(x_example, probs = seq(0, 1, 0.333333))
0% 33.3333% 66.6666% 99.9999%
10.00000 15.00000 23.00000 29.99994
# help(quantile)
The results below gives the density plots using the uniform kernel and triangular kernel.
# Density plot
plot(density(x_example), main = "", xlab = "x")
plot(density(x_example, bw = .33), main = "", xlab = "x") # change the bandwidth
plot(density(x_example, kernel = "triangular"), main="", xlab = "x") # change the kernel
This section employs non-parametric estimation tools for model selection for the claims data of the Property Fund.
The results below gives the empirical distribution function of the claims and claims in logarithmic units.
claim_lev <- read.csv("DATA/CLAIMLEVEL.csv", header=TRUE)
nrow(claim_lev) # 6258
[1] 6258
claim_data <- subset(claim_lev, Year==2010) # 2010 subset
# Empirical distribution function of Property Fund
par(mfrow = c(1, 2))
percentiles <- ecdf(claim_data$Claim)
log_percentiles <- ecdf(log(claim_data$Claim))
plot(percentiles, main = "", xlab = "Claims")
plot(log_percentiles, main = "", xlab = "Logarithmic Claims")
Shows a histogram (with shaded gray rectangles) of logarithmic property claims from 2010. The blue thick curve represents a Gaussian kernel density where the bandwidth was selected automatically using an ad hoc rule based on the sample size and volatility of the data.
# Density comparison
hist(log(claim_data$Claim), main = "", ylim = c(0, .35), xlab = "Log Expenditures",
freq = FALSE, col = "lightgray")
lines(density(log(claim_data$Claim)), col = "blue", lwd = 2.5)
lines(density(log(claim_data$Claim), bw = 1), col = "green")
lines(density(log(claim_data$Claim), bw = .1), col = "red", lty = 3)
density(log(claim_data$Claim))$bw # default bandwidth
[1] 0.3255908
The results below fits gamma and Pareto distribution to the claims data.
library(MASS)
library(VGAM)
# Inference assuming a gamma distribution
fit.gamma_2 <- glm(Claim ~ 1, data = claim_data, family = Gamma(link = log))
summary(fit.gamma_2, dispersion = gamma.dispersion(fit.gamma_2))
Call:
glm(formula = Claim ~ 1, family = Gamma(link = log), data = claim_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.287 -2.258 -1.764 -1.178 30.926
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 10.18952 0.04999 203.8 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 3.441204)
Null deviance: 6569.1 on 1376 degrees of freedom
Residual deviance: 6569.1 on 1376 degrees of freedom
AIC: 28414
Number of Fisher Scoring iterations: 14
( theta <- exp(coef(fit.gamma_2)) * gamma.dispersion(fit.gamma_2)) # mu = theta / alpha
(Intercept)
91613.78
( alpha <- 1 / gamma.dispersion(fit.gamma_2) )
[1] 0.2905959
# Inference assuming a Pareto distribution
fit.pareto <- vglm(Claim ~ 1, paretoII, loc = 0, data = claim_data)
summary(fit.pareto)
Call:
vglm(formula = Claim ~ 1, family = paretoII, data = claim_data,
loc = 0)
Pearson residuals:
Min 1Q Median 3Q Max
loge(scale) -6.332 -0.8289 0.1875 0.8832 1.174
loge(shape) -10.638 0.0946 0.4047 0.4842 0.513
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 7.7329210 0.0933332 82.853 <2e-16 ***
(Intercept):2 -0.0008753 0.0538642 -0.016 0.987
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of linear predictors: 2
Names of linear predictors: loge(scale), loge(shape)
Log-likelihood: -13404.64 on 2752 degrees of freedom
Number of iterations: 5
No Hauck-Donner effect found in any of the estimates
head(fitted(fit.pareto))
[,1]
[1,] 2285.03
[2,] 2285.03
[3,] 2285.03
[4,] 2285.03
[5,] 2285.03
[6,] 2285.03
exp(coef(fit.pareto))
(Intercept):1 (Intercept):2
2282.2590626 0.9991251
The graphs below reinforces the technique of overlaying graphs for comparison purposes using both the distribution function and density function. Pareto distribution provides a better fit.
# Plotting the fit using densities (on a logarithmic scale)
# None of these distributions is doing a great job....
x <- seq(0, 15, by = 0.01)
par(mfrow = c(1, 2))
log_percentiles <- ecdf(log(claim_data$Claim))
plot(log_percentiles, main = "", xlab = "Claims", cex = 0.4)
Fgamma_ex <- pgamma(exp(x), shape = alpha, scale = theta)
lines(x, Fgamma_ex, col = "blue")
Fpareto_ex <- pparetoII(exp(x), loc = 0,shape = exp(coef(fit.pareto)[2]),
scale = exp(coef(fit.pareto)[1]))
lines(x, Fpareto_ex, col = "purple")
legend("bottomright", c("log(claims)", "Gamma", "Pareto"), lty = 1, cex = 0.6,
col = c("black","blue","purple"))
plot(density(log(claim_data$Claim)) , main = "", xlab = "Log Expenditures")
fgamma_ex <- dgamma(exp(x), shape = alpha, scale = theta) * exp(x)
lines(x, fgamma_ex, col = "blue")
fpareto_ex <- dparetoII(exp(x), loc = 0, shape = exp(coef(fit.pareto)[2]),
scale = exp(coef(fit.pareto)[1])) * exp(x)
lines(x, fpareto_ex, col = "purple")
legend("topright", c("log(claims)", "Gamma", "Pareto"), lty = 1, cex = 0.6,
col = c("black","blue","purple"))
Shows \(pp\) plots for the Property Fund data; the fitted gamma is on the left and the fitted Pareto is on the right. Pareto distribution provides a better fit again.
# PP Plot
par(mfrow = c(1, 2))
Fgamma_ex <- pgamma(claim_data$Claim, shape = alpha, scale = theta)
plot(percentiles(claim_data$Claim), Fgamma_ex, xlab = "Empirical DF",
ylab = "Gamma DF", cex = 0.4)
abline(0, 1)
Fpareto_ex <- pparetoII(claim_data$Claim, loc = 0,
shape = exp(coef(fit.pareto)[2]),
scale = exp(coef(fit.pareto)[1]))
plot(percentiles(claim_data$Claim), Fpareto_ex, xlab = "Empirical DF",
ylab = "Pareto DF", cex = 0.4)
abline(0, 1)
#dev.off()
In the graphs below the quantiles are plotted on the original scale in the left-hand panels, on the log scale in the right-hand panel, to allow the analyst to see where a fitted distribution is deficient.
# Q-Q plot
par(mfrow = c(2, 2))
x_seq <- seq(0.0001, 0.9999, by = 1 / length(claim_data$Claim))
emp_quant <- quantile(claim_data$Claim, x_seq)
gamma_quant <- qgamma(x_seq, shape = alpha, scale = theta)
plot(emp_quant, gamma_quant, xlab = "Empirical Quantile", ylab = "Gamma Quantile")
abline(0, 1)
plot(log(emp_quant), log(gamma_quant), xlab = "Log Emp Quantile",
ylab = "Log Gamma Quantile")
abline(0, 1)
pareto_quant <- qparetoII(x_seq, loc = 0, shape = exp(coef(fit.pareto)[2]),
scale = exp(coef(fit.pareto)[1]))
plot(emp_quant, pareto_quant, xlab = "Empirical Quantile", ylab = "Pareto Quantile")
abline(0, 1)
plot(log(emp_quant), log(pareto_quant), xlab = "Log Emp Quantile",
ylab="Log Pareto Quantile")
abline(0, 1)
For reporting results, it can be effective to supplement graphical displays with selected statistics that summarize model goodness of fit. The results below provides three commonly used goodness of fit statistics.
library(goftest)
# Kolmogorov-Smirnov # the test statistic is "D"
ks.test(claim_data$Claim, "pgamma", shape = alpha, scale = theta)
One-sample Kolmogorov-Smirnov test
data: claim_data$Claim
D = 0.26387, p-value < 2.2e-16
alternative hypothesis: two-sided
ks.test(claim_data$Claim, "pparetoII", loc = 0, shape = exp(coef(fit.pareto)[2]),
scale = exp(coef(fit.pareto)[1]))
One-sample Kolmogorov-Smirnov test
data: claim_data$Claim
D = 0.047824, p-value = 0.003677
alternative hypothesis: two-sided
# Cramer-von Mises # the test statistic is "omega_2"
cvm.test(claim_data$Claim, "pgamma", shape = alpha, scale = theta)
Cramer-von Mises test of goodness-of-fit
Null hypothesis: Gamma distribution
with parameters shape = 0.290595934110839, scale =
91613.779421033
data: claim_data$Claim
omega2 = 33.378, p-value = 2.549e-05
cvm.test(claim_data$Claim, "pparetoII", loc = 0, shape = exp(coef(fit.pareto)[2]),
scale = exp(coef(fit.pareto)[1]))
Cramer-von Mises test of goodness-of-fit
Null hypothesis: distribution 'pparetoII'
with parameters shape = 0.999125131378519, scale =
2282.25906257586
data: claim_data$Claim
omega2 = 0.38437, p-value = 0.07947
# Anderson-Darling # the test statistic is "An"
ad.test(claim_data$Claim, "pgamma", shape = alpha, scale = theta)
Anderson-Darling test of goodness-of-fit
Null hypothesis: Gamma distribution
with parameters shape = 0.290595934110839, scale =
91613.779421033
data: claim_data$Claim
An = Inf, p-value = 4.357e-07
ad.test(claim_data$Claim, "pparetoII", loc = 0, shape = exp(coef(fit.pareto)[2]),
scale = exp(coef(fit.pareto)[1]))
Anderson-Darling test of goodness-of-fit
Null hypothesis: distribution 'pparetoII'
with parameters shape = 0.999125131378519, scale =
2282.25906257586
data: claim_data$Claim
An = 4.1264, p-value = 0.007567
Losses follow the distribution function \(F(x)=1-(\theta/x),\quad x>0\). A sample of 20 losses resulted in the following:
Interval | Number of Losses |
---|---|
(0,10] | 9 |
(10,25] | 6 |
(25,infinity) | 5 |
Calculate the maximum likelihood estimate of \(\theta\).
# Log likelihood function
lik_grp <- function (theta) {
log_like <- log(((1 - (theta / 10))^9) * (((theta / 10) - (theta / 25))^6) *
(((theta / 25))^5))
return(-sum(log_like))
}
# "optim" is a general purpose minimization function
grp_lik <- optim(c(1), lik_grp, method = c("L-BFGS-B"), hessian = TRUE)
# Estimates - Answer "B" on SoA Problem
grp_lik$par
[1] 5.5
# Standard error
sqrt(diag(solve(grp_lik$hessian)))
[1] 1.11243
# t-statistics
( tstat <- grp_lik$par / sqrt(diag(solve(grp_lik$hessian))) )
[1] 4.944132
# Plot of Negative Log-Likelihood function
vllh <- Vectorize(lik_grp, "theta")
theta <- seq(0, 10, by = 0.01)
plot(theta, vllh(theta), pch = 16, main = "Negative Log-Likelihood Function", cex = .25,
xlab = expression(theta), ylab = expression(paste("L(", theta, ")")))
This file contains illustrative R code for computing important count distributions. When reviewing this code, you should open an R session, copy-and-paste the code, and see it perform. Then, you will be able to change parameters, look up commands, and so forth, as you go.
This section shows how to use the inversion method to simulate claims from a gamma distribution. The results below are summary statistics from the simulated data.
# Simulation - gamma
library(moments)
set.seed(2) # set seed to reproduce work
n_tot <- 20000 # number of simulations
alpha <- 2
theta <- 100
losses <- rgamma(n_tot, alpha, scale = theta)
summary(losses)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0921 96.3265 167.8035 200.1747 268.2257 1110.1298
k <- 0.95
percentile_loss <- quantile(losses,k) # Kth percentile of losses
percentile_loss
95%
473.8218
#######################################
# OR you can use this method to simulate losses
# Fx <- runif(n_tot)
# losses <- qgamma(Fx, alpha, scale = theta)
#######################################
# For the Pareto Distribution, use
# library(VGAM)
# n_tot <- 10000 # number of simulations
# alpha <- 3
# theta <- 25000
# losses <- rparetoII(n_tot, scale = theta, shape = alpha)
# rparetoII(n_tot, scale = theta, shape = alpha)
A few quick notes on these commands:
rgamma()
function randomly generates data from the Gamma distribution. In this illustration the data was generated from a gamma distribution with parameters shape = alpha = 2
and scale = theta = 100
.quantile()
function provides sample quantiles corresponding to the given probabilities. Here we wanted the simulated loss data corresponding to the 95th percentile.library(pander)
# Raw moments for k = 0.5
# Theoretical
k <- 0.5
T_0.5 <- round(((theta^k) * gamma(alpha + k)) / gamma(alpha), 2)
# Simulated data raw moments
S_0.5 <- round(moment(losses, order = k, central = FALSE), 2)
# Raw moments for k = 1
# Theoretical
k <- 1
T_1 <- ((theta^k) * gamma(alpha + k)) / gamma(alpha)
# Simulated data raw moments
S_1 <- round(moment(losses, order = k, central = FALSE), 2)
# Raw moments for k = 2
# Theoretical
k <- 2
T_2 <- ((theta^k) * gamma(alpha + k)) / gamma(alpha)
#Simulated data raw moments
S_2<-round(moment(losses, order = k, central = FALSE),2)
# Raw moments for k = 3
# Theoretical
k <- 3
T_3<-((theta^k) * gamma(alpha + k)) / gamma(alpha)
# Simulated data raw moments
S_3 <- round(moment(losses, order = k, central = FALSE), 2)
# Raw moments for k = 3
# Theoretical
k <- 4
T_4 <- ((theta^k) * gamma(alpha + k)) / gamma(alpha)
# Simulated data raw moments
S_4 <- round(moment(losses, order = k, central = FALSE), 2)
pander(rbind(c("k", 0.5, 1, 2, 3, 4), c("Theoretical", T_0.5, T_1, T_2, T_3, T_4),
c("Simulated", S_0.5, S_1, S_2, S_3, S_4)))
k | 0.5 | 1 | 2 | 3 | 4 |
Theoretical | 13.29 | 200 | 60000 | 2.4e+07 | 1.2e+10 |
Simulated | 13.3 | 200.17 | 60069.73 | 23993892.53 | 11897735665.89 |
A few quick notes on these commands:
moment()
function computes all the sample moments of the chosen type up to a given order. In this illustration, we wanted the raw sample moments of the kth order.round()
function was used to round values.This file demonstrates simulation of aggregate claim distributions. When reviewing this code, you should open an R session, copy-and-paste the code, and see it perform. Then, you will be able to change parameters, look up commands, and so forth, as you go.
S = X_1 + … + X_N
Assume N ~ Poisson(lambda=2) and X ~ Pareto(alpha=3,theta=5000)
lambda <- 2
alpha <- 3
theta <- 5000
Graphing the our frequency (N) and severity (X) distributions
par(mfrow=c(1,2))
n <- 1:10
fn <- dpois(1:10,lambda)
plot(n,fn,ylim=c(0,0.3),main="Frequency: Poisson")
abline(h=0,lty=2)
x <- seq(1,25000,1)
fx <- alpha*theta^alpha/(x+theta)^(alpha+1)
plot(x,fx,type="l",main="Severity: Pareto")
We’re going to simulate 5000 observations of S
set.seed(123)
size <- 5000
S <- rep(NA,size)
N <- rpois(size,lambda)
for (i in 1:size){
uu <- runif(N[i])
X <- theta*((1-uu)^(-1/alpha)-1)
S[i] <- sum(X)
}
par(mfrow=c(1,2))
hist(S,freq=F,breaks=100)
plot(ecdf(S),xlab="S")
Here we show numerical descriptions of our simulated distribution S
mean(S) # sample mean
[1] 4829.894
sd(S) # sample standard deviation
[1] 6585.567
quantile(S,prob=c(0.05,0.5,0.95)) # percentiles
5% 50% 95%
0.000 2846.073 15983.408
sum((S==0))/size
[1] 0.1348
Pr(S=0)
sum(S<=mean(S))/size
[1] 0.6578
Pr(S<=E(S))
sum(S>mean(S))/size
[1] 0.3422
Pr(S>E(S))
VaR <- quantile(S,prob=0.99) # significance level = 0.01
CTE <- sum(S*(S>VaR))/sum((S>VaR))
rm <- c(VaR,CTE)
names(rm) <- c("VaR","CTE")
print(rm)
VaR CTE
28636.56 43193.19
Here we plot how the premium for a stop-loss insurance product changes based on the size of the deductible
par(mfrow=c(1,1))
d <- seq(0,120000,1000)
price <- rep(NA,length(d))
for (i in 1:length(d)){
price[i] = sum((S-d[i])*(S>d[i]))/size
}
plot(d,price,xlab="Deductible",ylab="Stop-Loss Premium",type="b")
This file contains illustrative R code for calculations involving frequency and severity of distributions. When reviewing this code, you should open an R session, copy-and-paste the code, and see it perform. Then, you will be able to change parameters, look up commands, and so forth, as you go.
Before we can do any analysis we must import the data.
Import the excel file into R.
dat <- read.csv(file = "Data/MassAuto.csv",header=TRUE)
This code outputs a list of all the variable names of the excel file. This is useful for determining what kind of data you re working with.
names(dat)
[1] "VIN" "Territory" "Class" "Loss1" "Loss2"
This code creates a new column representing the sum of the two loss columns.
dat$Loss <- dat$Loss1 + dat$Loss2
You may have to install the package “dplyr” to run this code.
library(dplyr)
freq.dat <- dat %>% group_by(VIN) %>% summarise(tLoss = sum(Loss),count = sum(Loss>0))
dim(freq.dat)
[1] 49894 3
Here we fit a poisson distrubtion to the data and run log likelihood to determine the most likely parameter for the distribution. We then calculate the standard error of this estimate.
loglikPois<-function(parms){
lambda=parms[1]
llk <- -sum(log(dpois(freq.dat$count,lambda)))
llk
}
ini.Pois <- 1
zop.Pois <- nlminb(ini.Pois,loglikPois,lower=c(1e-6),upper=c(Inf))
print(zop.Pois)
$par
[1] 0.04475488
$objective
[1] 9243.476
$convergence
[1] 0
$iterations
[1] 17
$evaluations
function gradient
24 20
$message
[1] "relative convergence (4)"
library(numDeriv)
est <- zop.Pois$par
names(est) <- c("lambda")
hess<-hessian(loglikPois,est)
se <-sqrt(diag(solve(hess)))
print(cbind(est,se))
est se
lambda 0.04475488 0.0009471004
Now we fit a negative binomial distribution to the data using log likelihood.
We then calculate the standard error of this estimate.
dnb <- function(y,r,beta){
gamma(y+r)/gamma(r)/gamma(y+1)*(1/(1+beta))^r*(beta/(1+beta))^y
}
loglikNB<-function(parms){
r=parms[1]
beta=parms[2]
llk <- -sum(log(dnb(freq.dat$count,r,beta)))
llk
}
ini.NB <- c(1,1)
zop.NB <- nlminb(ini.NB,loglikNB,lower=c(1e-6,1e-6),upper=c(Inf,Inf))
print(zop.NB)
$par
[1] 0.86573901 0.05169541
$objective
[1] 9218.902
$convergence
[1] 0
$iterations
[1] 27
$evaluations
function gradient
32 63
$message
[1] "relative convergence (4)"
library(numDeriv)
est <- zop.NB$par
names(est) <- c("r","beta")
hess<-hessian(loglikNB,est)
se <-sqrt(diag(solve(hess)))
print(cbind(est,se))
est se
r 0.86573901 0.161126426
beta 0.05169541 0.009686412
Here we calculate goodness of fit for the emipircal, poission, and negative binomial models.
lambda<-zop.Pois$par
r<-zop.NB$par[1]
beta<-zop.NB$par[2]
numrow<-max(freq.dat$count)+1
emp<-rep(0,numrow+1)
for(i in 1:(numrow+1)){
emp[i]<-sum(freq.dat$count==i-1)
}
pois<-rep(0,numrow+1)
for(i in 1:numrow){
pois[i]<-length(freq.dat$count)*dpois(i-1,lambda)
}
pois[numrow+1]<- length(freq.dat$count)-sum(pois)
nb<-rep(0,numrow+1)
for(i in 1:numrow){
nb[i]<-length(freq.dat$count)*dnb(i-1,r,beta)
}
nb[numrow+1]<- length(freq.dat$count)-sum(nb)
freq <- cbind(emp,pois,nb)
rownames(freq) <- c("0","1","2","3",">3")
colnames(freq) <- c("Empirical","Poisson","NegBin")
round(freq,digits=3)
Empirical Poisson NegBin
0 47763 47710.232 47763.629
1 2036 2135.266 2032.574
2 88 47.782 93.203
3 7 0.713 4.376
>3 0 0.008 0.218
Here we run chi square to determine the goodness of fit
chi.pois <- sum((pois-emp)^2/pois)
chi.negbin <- sum((nb-emp)^2/nb)
chisq <- c(Poisson=chi.pois, NegBin=chi.negbin)
print(chisq)
Poisson NegBin
93.986694 2.087543
sev.dat <- subset(dat,Loss>0)
dim(sev.dat)
[1] 2233 6
You may have to install the package “VGAM” to run this code.
library(VGAM)
fit.LN <- vglm(Loss ~ 1, family=lognormal, data = sev.dat)
summary(fit.LN)
Call:
vglm(formula = Loss ~ 1, family = lognormal, data = sev.dat)
Pearson residuals:
Min 1Q Median 3Q Max
meanlog -4.1427 -0.4450 0.05912 0.5917 2.254
loge(sdlog) -0.7071 -0.6636 -0.51680 -0.1437 11.428
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 7.16870 0.03662 195.76 <2e-16 ***
(Intercept):2 0.54838 0.01496 36.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of linear predictors: 2
Names of linear predictors: meanlog, loge(sdlog)
Log-likelihood: -20400.73 on 4464 degrees of freedom
Number of iterations: 3
No Hauck-Donner effect found in any of the estimates
Coefficients (note scale parameter is in log scale).
coef(fit.LN)
(Intercept):1 (Intercept):2
7.1687024 0.5483791
Confidence intervals for model parameters.
confint(fit.LN, level=0.95)
2.5 % 97.5 %
(Intercept):1 7.0969291 7.2404757
(Intercept):2 0.5190507 0.5777076
Loglikelihood for lognormal.
logLik(fit.LN)
[1] -20400.73
AIC for lognormal.
AIC(fit.LN)
[1] 40805.47
BIC for lognormal.
BIC(fit.LN)
[1] 40816.89
Covariance matrix for model parameters.
vcov(fit.LN)
(Intercept):1 (Intercept):2
(Intercept):1 0.001341004 0.000000000
(Intercept):2 0.000000000 0.000223914
Here we estimate sigma directly instead of in log scale.
loglikLN<-function(parms){
mu=parms[1]
sigma=parms[2]
llk <- -sum(log(dlnorm(sev.dat$Loss, mu, sigma)))
llk
}
ini.LN <- c(coef(fit.LN)[1],exp(coef(fit.LN)[2]))
zop.LN <- nlminb(ini.LN,loglikLN,lower=c(-Inf,1e-6),upper=c(Inf,Inf))
print(zop.LN)
$par
(Intercept):1 (Intercept):2
7.168702 1.730446
$objective
[1] 20400.73
$convergence
[1] 0
$iterations
[1] 1
$evaluations
function gradient
2 2
$message
[1] "relative convergence (4)"
library(numDeriv)
est <- zop.LN$par
names(est) <- c("mu","sigma")
hess<-hessian(loglikLN,est)
se <-sqrt(diag(solve(hess)))
print(cbind(est,se))
est se
mu 7.168702 0.03661961
sigma 1.730446 0.02589397
You may have to install the package “VGAM” to run this code.
library(VGAM)
fit.pareto <- vglm(Loss ~ 1, paretoII, loc=0, data = sev.dat)
summary(fit.pareto)
Call:
vglm(formula = Loss ~ 1, family = paretoII, data = sev.dat, loc = 0)
Pearson residuals:
Min 1Q Median 3Q Max
loge(scale) -2.044 -0.7540 0.02541 0.8477 1.2958
loge(shape) -5.813 0.1526 0.36197 0.4322 0.4559
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 7.99629 0.08417 95.004 <2e-16 ***
(Intercept):2 0.52653 0.05699 9.239 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of linear predictors: 2
Names of linear predictors: loge(scale), loge(shape)
Log-likelihood: -20231.91 on 4464 degrees of freedom
Number of iterations: 5
No Hauck-Donner effect found in any of the estimates
head(fitted(fit.pareto))
[,1]
[1,] 1502.569
[2,] 1502.569
[3,] 1502.569
[4,] 1502.569
[5,] 1502.569
[6,] 1502.569
coef(fit.pareto) #note both parameters are in log scale
(Intercept):1 (Intercept):2
7.9962932 0.5265276
exp(coef(fit.pareto)) #estimate of parameters
(Intercept):1 (Intercept):2
2969.928635 1.693043
confint(fit.pareto, level=0.95) #confidence intervals for model parameters
2.5 % 97.5 %
(Intercept):1 7.831327 8.1612593
(Intercept):2 0.414834 0.6382213
logLik(fit.pareto) #loglikelihood for pareto
[1] -20231.91
AIC(fit.pareto) #AIC for pareto
[1] 40467.83
BIC(fit.pareto) #BIC for pareto
[1] 40479.25
vcov(fit.pareto) #covariance matrix for model parameters
(Intercept):1 (Intercept):2
(Intercept):1 0.007084237 0.004453555
(Intercept):2 0.004453555 0.003247586
Here we estimate alpha and theta directly to define the pareto density.
dpareto <- function(y,theta,alpha){
alpha*theta^alpha/(y+theta)^(alpha+1)
}
loglikP<-function(parms){
theta=parms[1]
alpha=parms[2]
llk <- -sum(log(dpareto(sev.dat$Loss,theta,alpha)))
llk
}
ini.P <- exp(coef(fit.pareto))
zop.P <- nlminb(ini.P,loglikP,lower=c(1e-6,1e-6),upper=c(Inf,Inf))
print(zop.P)
$par
(Intercept):1 (Intercept):2
2969.928635 1.693043
$objective
[1] 20231.91
$convergence
[1] 0
$iterations
[1] 1
$evaluations
function gradient
1 2
$message
[1] "both X-convergence and relative convergence (5)"
library(numDeriv)
est <- zop.P$par
names(est) <- c("theta","alpha")
hess<-hessian(loglikP,est)
se <-sqrt(diag(solve(hess)))
print(cbind(est,se))
est se
theta 2969.928635 248.24191162
alpha 1.693043 0.09590892
prepare the display window parameters to properly fit the histograms
par(mfrow=c(1,2))
hist(sev.dat$Loss,xlab="Total Losses",main="LN",breaks=100,freq=F,xlim=c(0,3e4),ylim=c(0,8e-4))
x <- seq(1,max(sev.dat$Loss),1)
mu <- zop.LN$par[1]
sigma <- zop.LN$par[2]
lines(x,dlnorm(x,mu,sigma),col="red")
hist(sev.dat$Loss,xlab="Total Losses",main="Pareto",breaks=100,freq=F,xlim=c(0,3e4),ylim=c(0,8e-4))
x <- seq(1,max(sev.dat$Loss),1)
theta <- zop.P$par[1]
alpha <- zop.P$par[2]
lines(x,dpareto(x,theta,alpha),col="blue")
qpareto <- function(p,theta,alpha){theta*((1-p)^(-1/alpha)-1)}
pct <- seq(0.01,0.99,0.01)
par(mfrow=c(1,2))
plot(qlnorm(pct,mu,sigma),quantile(sev.dat$Loss,probs=pct),
main="LN", xlab="Theoretical Quantile", ylab="Empirical Quantile",
xlim=c(0,7.5e4),ylim=c(0,7.5e4))
abline(0,1)
plot(qpareto(pct,theta,alpha),quantile(sev.dat$Loss,probs=pct),
main="Pareto", xlab="Theoretical Quantile", ylab="Empirical Quantile",
xlim=c(0,7.5e4),ylim=c(0,7.5e4))
abline(0,1)
This file contains illustrative R code for the Tweedie distribution. When reviewing this code, you should open an R session, copy-and-paste the code, and see it perform. Then, you will be able to change parameters, look up commands, and so forth, as you go.
First bring in the package Tweedie (you may need to first install this package).
library(tweedie)
Setting parameters p, mu and phi defines the specific features of the distribution.
Furthermore, setting a specific seed allows us to generate the same randomn numbers so we can produce identical distributions
set.seed(123)
p <- 1.5
mu <- exp(1)
phi <- exp(1)
Sample size is set to 500 for this example. “y” holds all 500 obserations from tweedie distribution with the given parameters.
n <- 500
y <- rtweedie(n,p,mu,phi)
Here we calculate important statisitics like mean, median, standard deviation and quantiles.
summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.438 2.687 3.878 24.181
sd(y)
[1] 3.346954
quantile(y,seq(0,1,0.1))
0% 10% 20% 30% 40% 50%
0.0000000 0.0000000 0.0000000 0.0000000 0.7275496 1.4378933
60% 70% 80% 90% 100%
2.3767214 3.4212150 5.2317625 7.7471281 24.1813833
Histograms are useful for visually interpreting data. Sometime summary statistics aren’t enough to see the full picture.
hist(y, prob=T,breaks=50)
x <- seq(0,max(y),0.1)
lines(x,dtweedie(x,p,mu,phi),col="red")
A QQ plot is a plot of the quantiles of the first data set against the quantiles of the second data set.
This is graphical technique for determining if two data sets come from populations with a common distribution.
It appears here that a power of 1.5 matches the distribution best.
par(mfrow=c(2,2),mar=c(4,4,4,4))
qqTweedie <- function(xi,pct,mu,phi) {
plot(qtweedie(pct,xi,mu,phi),quantile(y,probs=pct),
main=paste("Power = ",xi), xlab="Theoretical Quantile", ylab="Empirical Quantile")
abline(0,1)
}
pct <- seq(0.01,0.99,0.01)
lapply(c(1,1.5,2,2.5),qqTweedie,pct=pct,mu=mu,phi=phi)
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
Here we run a “glm” for the Tweedie distribution. you may need to first install the “statmod” package
library(statmod)
fit <- glm(y~1,family=tweedie(var.power=1.5,link.power=0))
summary(fit)
Call:
glm(formula = y ~ 1, family = tweedie(var.power = 1.5, link.power = 0))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5607 -2.5607 -0.6876 0.5155 5.1207
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9885 0.0557 17.75 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Tweedie family taken to be 2.542861)
Null deviance: 1618.8 on 499 degrees of freedom
Residual deviance: 1618.8 on 499 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 4
We now display the parameter estimates calculated in the glm.
summary(fit)$coefficient
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9885415 0.0556989 17.74795 5.42159e-55
summary(fit)$dispersion
[1] 2.542861
Here we run a “MLE” to determine the most likely parameters of the Tweedie distribution.
loglik<-function(parms){
p=parms[1]
mu=exp(parms[2])
phi=exp(parms[3])
llk <- -sum(log(dtweedie(y, p, mu, phi)))
llk
}
ini <- c(1.5,1,1)
zop <- nlminb(ini,loglik, lower =c(1+1e-6,-Inf,-Inf),upper =c(2-1e-6,Inf,Inf))
print(zop)
$par
[1] 1.4823346 0.9885411 0.9871154
$objective
[1] 1122.992
$convergence
[1] 0
$iterations
[1] 14
$evaluations
function gradient
23 77
$message
[1] "relative convergence (4)"
Now we calculate the standard errors of our parameter estimates from the MLE. You may need to first install the “numDeriv” package.
library(numDeriv)
est <- zop$par
names(est) <- c("p","mu","phi")
hess<-hessian(loglik,est)
se <-sqrt(diag(solve(hess)))
print(cbind(est,se))
est se
p 1.4823346 0.02260226
mu 0.9885411 0.05672086
phi 0.9871154 0.05060983
This file demonstrates both empirical and parametric boostrap simulation. When reviewing this code, you should open an R session, copy-and-paste the code, and see it perform. Then, you will be able to change parameters, look up commands, and so forth, as you go.
Example: 90% confidence interval for the mean. Consider outcomes of rolling a fair die.
Here we input a sample of 10 observations, which we are going to build our estimates off of.
y <- c(1,3,2,5,4,5,5,6,6,6)
n <- length(y)
n
[1] 10
Finding the mean of the random sample.
xbar <- mean(y)
xbar
[1] 4.3
We’re going to generate 30 different observations using the original sample data, called the boostrap sample
nboot <- 30
tmpdata <- sample(y,n*nboot,replace=TRUE)
bootstrap.sample <- matrix(tmpdata,nrow=n,ncol=nboot)
Here we find the mean for all 30 of our boostrap samples.
bsmeans <- colMeans(bootstrap.sample)
bsmeans
[1] 3.8 4.8 4.5 4.4 4.4 4.3 4.5 4.5 5.2 4.6 3.5 4.1 4.6 4.3 4.5 4.9 4.6
[18] 4.0 4.8 4.0 4.5 4.0 4.8 4.3 4.2 4.6 3.3 3.4 4.4 3.6
Using the generated boostrap sample, we calculate a 90% confidence interval for our sample mean.
CI <- c(quantile(bsmeans,prob=0.05), quantile(bsmeans,prob=0.95))
CI
5% 95%
3.445 4.855
Example: confidence interval for 1/theta. Consider y ~ exponential(theta=10).
This time we generate a random sample of 250 observations from the exponential sampling distribution.
y <- rexp(250,rate=0.1)
n <- length(y)
n
[1] 250
Remember from earlier that the MLE of theta for an exponential distribution is its mean.
theta.mle <- mean(y)
rate.hat <- 1/theta.mle
Using the MLE we calculated we now generate 500 bootstrap samples of 250 observations each.
nboot <- 500
tmpdata <- rexp(n*nboot,rate=rate.hat)
bootstrap.sample <- matrix(tmpdata,nrow=n,ncol=nboot)
We now find the rate parameter for each of our boostrap samples
rate.star <- 1/colMeans(bootstrap.sample)
Subtracting the original sample rate parameter from each of our simulated rate parameters, we can find the 5th and 95th percentiles of the differences.
delta.star <- rate.star - rate.hat
delta.lb <- quantile(delta.star,prob=0.05)
delta.ub <- quantile(delta.star,prob=0.95)
We use the percentiles to create a 90% CI for the rate parameter.
CI <- rate.hat - c(delta.ub,delta.lb)
print(CI)
95% 5%
0.08817793 0.10884260