Title: | Modeling of Ordinal Random Variables via Softmax Regression |
---|---|
Description: | Supports the modeling of ordinal random variables, like the outcomes of races, via Softmax regression, under the Harville <doi:10.1080/01621459.1973.10482425> and Henery <doi:10.1111/j.2517-6161.1981.tb01153.x> models. |
Authors: | Steven E. Pav [aut, cre] |
Maintainer: | Steven E. Pav <[email protected]> |
License: | LGPL-3 |
Version: | 0.1.1 |
Built: | 2024-11-04 03:03:15 UTC |
Source: | https://github.com/shabbychef/ohenery |
A model for odds linear in some feature.
as.linodds(object, formula, beta) ## S3 method for class 'linodds' predict(object, newdata, type = c("eta", "mu", "erank"), na.action = na.pass, group = NULL, ...) ## S3 method for class 'linodds' coef(object, ...)
as.linodds(object, formula, beta) ## S3 method for class 'linodds' predict(object, newdata, type = c("eta", "mu", "erank"), na.action = na.pass, group = NULL, ...) ## S3 method for class 'linodds' coef(object, ...)
object |
some list-like object. |
formula |
an object of class |
beta |
the fit coefficients. |
newdata |
a |
type |
indicates which prediction should be returned:
|
na.action |
How to deal with missing values in |
group |
the string name of the group variable in the data, or a bare character with the group name. The group indices need not be integers, but that is more efficient. They need not be sorted. |
... |
other arguments. |
An object which holds a formula, some fit coefficients
which fit in that formula to generate odds
in odds space.
The odds can then be converted, via
predict.linodds
to probabilities,
or to expected ranks under the Harville model.
Both harsm
and hensm
return
objects of class linodds
.
We think of linear odds as
,
for independent variables
. The odds,
are converted to probabilities,
via
where the constant
is chosen so the
for a given matching
sum to one.
Steven E. Pav [email protected]
Historical data on the Best Picture nominees and winners from 1934 through 2014.
data(best_picture)
data(best_picture)
A data.frame
object with 484 observations and 19 columns.
The columns are defined as follows:
year
The integer year of the nomination. These span from 1934 through 2014. Note that the number of films nominated per year varies from 5 to 12.
film
The title of the film.
winner
A logical for whether the film won the Oscar for Best Picture. There is exactly one winning film per year.
nominated_for_Writing
A logical indicating whether the film was also nominated for a Writing award that year.
nominated_for_BestDirector
A logical indicating whether the film was also nominated for Best Director award that year.
nominated_for_BestActress
A logical indicating whether the film was also nominated for at least one Best Actress award that year.
nominated_for_BestActor
A logical indicating whether the film was also nominated for at least one Best Actor award that year.
nominated_for_BestFilmEditing
A logical indicating whether the film was also nominated for at least one Best Film Editing award that year.
Adventure
A double computed as a 0/1 indicator of whether “Adventure” was one of the genres tagged for the film in IMDb, divided by the total count of genres tagged for the film.
Biography
A double computed as a 0/1 indicator of whether “Biography” was one of the genres tagged for the film in IMDb, divided by the total count of genres tagged for the film.
Comedy
A double computed as a 0/1 indicator of whether “Comedy” was one of the genres tagged for the film in IMDb, divided by the total count of genres tagged for the film.
Crime
A double computed as a 0/1 indicator of whether “Crime” was one of the genres tagged for the film in IMDb, divided by the total count of genres tagged for the film.
Drama
A double computed as a 0/1 indicator of whether “Drama” was one of the genres tagged for the film in IMDb, divided by the total count of genres tagged for the film.
History
A double computed as a 0/1 indicator of whether “History” was one of the genres tagged for the film in IMDb, divided by the total count of genres tagged for the film.
Musical
A double computed as a 0/1 indicator of whether “Musical” was one of the genres tagged for the film in IMDb, divided by the total count of genres tagged for the film.
Romance
A double computed as a 0/1 indicator of whether “Romance” was one of the genres tagged for the film in IMDb, divided by the total count of genres tagged for the film.
Thriller
A double computed as a 0/1 indicator of whether “Thriller” was one of the genres tagged for the film in IMDb, divided by the total count of genres tagged for the film.
War
A double computed as a 0/1 indicator of whether “War” was one of the genres tagged for the film in IMDb, divided by the total count of genres tagged for the film.
Other
A double computed as 1 minus the sum of the other genre indicators. Effectively this is is the sum of indicators for “Mystery”, “Family”, “Fantasy”, “Action”, “Western”, “Music”, “Sport”, “Sci Fi”, “Film-Noir”, “Animation”, and “Horror” divided by the total count of genres tagged for the film.
“Oscar” is a copyright property of the Academy of Motion Picture Arts and Sciences. IMDb is owned by Amazon.
Steven E. Pav [email protected]
Awards data were sourced from Wikipedia, while genre data were sourced from IMDb. Any errors in transcription are the fault of the package author.
library(dplyr) data(best_picture) best_picture %>% group_by(nominated_for_BestDirector) %>% summarize(propwin=mean(winner)) %>% ungroup() best_picture %>% group_by(nominated_for_BestActor) %>% summarize(propwin=mean(winner)) %>% ungroup() # hmmmm. best_picture %>% group_by(nominated_for_BestActress) %>% summarize(propwin=mean(winner)) %>% ungroup()
library(dplyr) data(best_picture) best_picture %>% group_by(nominated_for_BestDirector) %>% summarize(propwin=mean(winner)) %>% ungroup() best_picture %>% group_by(nominated_for_BestActor) %>% summarize(propwin=mean(winner)) %>% ungroup() # hmmmm. best_picture %>% group_by(nominated_for_BestActress) %>% summarize(propwin=mean(winner)) %>% ungroup()
One hundred years of Men's Olympic Platform Diving records.
data(diving)
data(diving)
A data.frame
object with 695 observations and 13 columns.
The columns are defined as follows:
Name
The participant's name.
Age
The age of the participant at the time of the Olympics. Some values missing.
Height
The height of the participant at the time of the Olympics, in centimeters. Many values missing.
Weight
The height of the participant at the time of the Olympics, in kilograms. Many values missing.
Team
The string name of the team (country) which the participant represented.
NOC
The string name of the National Olympic Committee which the participant represented. This is a three character code.
Games
The string name of the Olympic games, including a year.
Year
The integer year of the Olympics. These range from 1906 through 2016.
City
The string name of the host city.
Medal
A string of “Gold”, “Silver”,
“Bronze” or NA
.
EventId
A unique integer ID for each Olympics.
AthleteId
A unique integer ID for each participant.
HOST_NOC
The string name of the National Olympic Committee of the nation hosting the Olympics. This is a three character code.
The author makes no guarantees regarding correctness of this data.
Please attribute this data to the upstream harvester.
Steven E. Pav [email protected]
Data were collected by Randi Griffin from the website “sports-reference.com”, and staged on Kaggle at https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results.
library(dplyr) library(forcats) data(diving) fitdat <- diving %>% mutate(Finish=case_when(grepl('Gold',Medal) ~ 1, grepl('Silver',Medal) ~ 2, grepl('Bronze',Medal) ~ 3, TRUE ~ 4)) %>% mutate(weight=ifelse(Finish <= 3,1,0)) %>% mutate(cut_age=cut(coalesce(Age,22.0),c(12,19.5,21.5,22.5,25.5,99),include.lowest=TRUE)) %>% mutate(country=forcats::fct_relevel(forcats::fct_lump(factor(NOC),n=5),'Other')) %>% mutate(home_advantage=NOC==HOST_NOC) hensm(Finish ~ cut_age + country + home_advantage,data=fitdat,weights=weight,group=EventId,ngamma=3)
library(dplyr) library(forcats) data(diving) fitdat <- diving %>% mutate(Finish=case_when(grepl('Gold',Medal) ~ 1, grepl('Silver',Medal) ~ 2, grepl('Bronze',Medal) ~ 3, TRUE ~ 4)) %>% mutate(weight=ifelse(Finish <= 3,1,0)) %>% mutate(cut_age=cut(coalesce(Age,22.0),c(12,19.5,21.5,22.5,25.5,99),include.lowest=TRUE)) %>% mutate(country=forcats::fct_relevel(forcats::fct_lump(factor(NOC),n=5),'Other')) %>% mutate(home_advantage=NOC==HOST_NOC) hensm(Finish ~ cut_age + country + home_advantage,data=fitdat,weights=weight,group=EventId,ngamma=3)
Compute the expected rank of a bunch of entries based on their probability of winning under the Harville model.
erank(mu)
erank(mu)
mu |
a vector giving the probabilities. Should sum to one. |
Given the vector , we compute the expected rank of each
entry, under the Harville model, where tail probabilities of winning
remain proportional.
Under the Harville model, the probability that the th element
is assigned value 1 is
Once an element has been assigned a 1, the Harville procedure
removes it from the set and iterates.
If there are elements in
, then the
th element
can be assigned any place between
and
. This
function computes the expected value of that random variable.
While a naive implementation of this function would take
time factorial in , this implementation takes time quadratic
in
, since it can be shown that the expected rank of the
th
element takes value
The expected ranks, a vector.
we should have the sum of ranks equal to the sum of 1:length(mu)
.
Steven E. Pav [email protected]
# a garbage example set.seed(12345) mus <- runif(12) mus <- mus / sum(mus) erank(mus) # confirm the expected rank via simulation set.seed(123) mus <- runif(6,min=0,max=2) mus <- mus / sum(mus) set.seed(101) emp <- rowMeans(replicate(200,rhenery(mu=mus,gamma=rep(1,length(mus)-1)))) (emp - erank(mus)) / emp if (require(microbenchmark)) { p10 <- 1:10 / sum(1:10) p16 <- 1:16 / sum(1:16) p24 <- 1:24 / sum(1:24) microbenchmark(erank(p10), erank(p16), erank(p24)) }
# a garbage example set.seed(12345) mus <- runif(12) mus <- mus / sum(mus) erank(mus) # confirm the expected rank via simulation set.seed(123) mus <- runif(6,min=0,max=2) mus <- mus / sum(mus) set.seed(101) emp <- rowMeans(replicate(200,rhenery(mu=mus,gamma=rep(1,length(mus)-1)))) (emp - erank(mus)) / emp if (require(microbenchmark)) { p10 <- 1:10 / sum(1:10) p16 <- 1:16 / sum(1:16) p24 <- 1:24 / sum(1:24) microbenchmark(erank(p10), erank(p16), erank(p24)) }
A user friendly interface to the softmax regression under the Harville model.
harsm(formula, data, group = NULL, weights = NULL, na.action = na.omit) ## S3 method for class 'harsm' vcov(object, ...) ## S3 method for class 'harsm' print(x, ...)
harsm(formula, data, group = NULL, weights = NULL, na.action = na.omit) ## S3 method for class 'harsm' vcov(object, ...) ## S3 method for class 'harsm' print(x, ...)
formula |
an object of class |
data |
an optional data frame, list or environment (or object
coercible by |
group |
the string name of the group variable in the data, or a bare character with the group name. The group indices need not be integers, but that is more efficient. They need not be sorted. |
weights |
an optional vector of weights, or the string or bare name of the
weights in the |
na.action |
How to deal with missing values in the outcomes, groups, weights, etc. |
object |
an object of class |
... |
additional arguments to be passed to the low level regression fitting functions (see below). |
x |
logicals. If |
Performs a softmax regression by groups, via Maximum Likelihood Estimation,
under the Harville model.
We fit where odds are
for
independent variables
.
The probability of taking first place is then
,
where the
is chosen so the
sum to one.
Under the Harville model, conditional on the first place finisher
having been observed, the probability model for second
(and successive) places with the probabilities of the remaining
participants renormalized.
The print
method of the harsm
object includes
a display of the R-squared. This measures the improvement
in squared errors of the expected rank from the model
over the null model which posits that all odds are equal.
When the formula includes an offset, a ‘delta R-squared’
is also output. This is the improvement in predicted
ranks over the model based on the offset term.
Note that the expected ranks are only easy to produce
under the Harville model; under the Henery model,
the summary R-squared statistics are not produced.
Note that this computation uses weighted sums of squares,
as the weights contribute to the likelihood term.
However, the square sum computation does not take into
account the standard error of the rank, and so
unlike in linear regression, the softmax regression
does not always give positive R-squareds,
and the statistic is otherwise hard to interpret.
An object of class harsm
, but also of maxLik
with the
fit.
Since version 0.1.0 of this package, the normalization of weights used in this function have changed under the hood. This is to give correct inference in the case where zero weights are used to signify finishing places were not observed. If in doubt, please confirm inference by simulations, taking as example the simulations in the README.
This regression may give odd results when the outcomes are tied, imposing an arbitrary order on the tied outcomes. Moreover, no warning may be issued in this case. In future releases, ties may be dealt with differently, perhaps in analogy to how ties are treated in the Cox Proportional Hazards regression, using the methods of Breslow or Efron.
To avoid incorrect inference when only the top performers are recorded, and all others are effectively tied, one should use weighting. Set the weights to zero for participants who are tied non-winners, and one for the rest So for example, if you observe the Gold, Silver, and Bronze medal winners of an Olympic event that had a starting field of 12 participants, set weights to 1 for the medal winners, and 0 for the others. Note that the weights do not attach to the participants, they attach to the place they took.
Steven E. Pav [email protected]
nfeat <- 5 set.seed(1234) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta y <- rsm(eta,g) # now the pretty frontend data <- cbind(data.frame(outcome=y,race=g),as.data.frame(X)) fmla <- outcome ~ V1 + V2 + V3 + V4 + V5 fitm <- harsm(fmla,data,group=race) eta0 <- rowMeans(X) data <- cbind(data.frame(outcome=y,race=g,eta0=eta0),as.data.frame(X)) fmla <- outcome ~ offset(eta0) + V1 + V2 + V3 + V4 + V5 fitm <- harsm(fmla,data,group=race) # with weights data <- cbind(data.frame(outcome=y,race=g,eta0=eta0),as.data.frame(X)) data$wts <- runif(nrow(data),min=1,max=2) fmla <- outcome ~ offset(eta0) + V1 + V2 + V3 + V4 + V5 fitm <- harsm(fmla,data,group=race,weights=wts) # softmax on the Best Picture data data(best_picture) df <- best_picture df$place <- ifelse(df$winner,1,2) df$weight <- ifelse(df$winner,1,0) fmla <- place ~ nominated_for_BestDirector + nominated_for_BestActor + Drama harsm(fmla,data=df,group=year,weights=weight) # test against logistic regression if (require(dplyr)) { nevent <- 10000 set.seed(1234) adf <- data_frame(eventnum=floor(seq(1,nevent + 0.7,by=0.5))) %>% mutate(x=rnorm(n()), program_num=rep(c(1,2),nevent), intercept=as.numeric(program_num==1), eta=1.5 * x + 0.3 * intercept, place=ohenery::rsm(eta,g=eventnum)) # Harville model modh <- harsm(place ~ intercept + x,data=adf,group=eventnum) # the collapsed data.frame for glm ddf <- adf %>% arrange(eventnum,program_num) %>% group_by(eventnum) %>% summarize(resu=as.numeric(first(place)==1), delx=first(x) - last(x), deli=first(intercept) - last(intercept)) %>% ungroup() # glm logistic fit modg <- glm(resu ~ delx + 1,data=ddf,family=binomial(link='logit')) all.equal(as.numeric(coef(modh)),as.numeric(coef(modg)),tolerance=1e-4) all.equal(as.numeric(vcov(modh)),as.numeric(vcov(modg)),tolerance=1e-4) }
nfeat <- 5 set.seed(1234) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta y <- rsm(eta,g) # now the pretty frontend data <- cbind(data.frame(outcome=y,race=g),as.data.frame(X)) fmla <- outcome ~ V1 + V2 + V3 + V4 + V5 fitm <- harsm(fmla,data,group=race) eta0 <- rowMeans(X) data <- cbind(data.frame(outcome=y,race=g,eta0=eta0),as.data.frame(X)) fmla <- outcome ~ offset(eta0) + V1 + V2 + V3 + V4 + V5 fitm <- harsm(fmla,data,group=race) # with weights data <- cbind(data.frame(outcome=y,race=g,eta0=eta0),as.data.frame(X)) data$wts <- runif(nrow(data),min=1,max=2) fmla <- outcome ~ offset(eta0) + V1 + V2 + V3 + V4 + V5 fitm <- harsm(fmla,data,group=race,weights=wts) # softmax on the Best Picture data data(best_picture) df <- best_picture df$place <- ifelse(df$winner,1,2) df$weight <- ifelse(df$winner,1,0) fmla <- place ~ nominated_for_BestDirector + nominated_for_BestActor + Drama harsm(fmla,data=df,group=year,weights=weight) # test against logistic regression if (require(dplyr)) { nevent <- 10000 set.seed(1234) adf <- data_frame(eventnum=floor(seq(1,nevent + 0.7,by=0.5))) %>% mutate(x=rnorm(n()), program_num=rep(c(1,2),nevent), intercept=as.numeric(program_num==1), eta=1.5 * x + 0.3 * intercept, place=ohenery::rsm(eta,g=eventnum)) # Harville model modh <- harsm(place ~ intercept + x,data=adf,group=eventnum) # the collapsed data.frame for glm ddf <- adf %>% arrange(eventnum,program_num) %>% group_by(eventnum) %>% summarize(resu=as.numeric(first(place)==1), delx=first(x) - last(x), deli=first(intercept) - last(intercept)) %>% ungroup() # glm logistic fit modg <- glm(resu ~ delx + 1,data=ddf,family=binomial(link='logit')) all.equal(as.numeric(coef(modh)),as.numeric(coef(modg)),tolerance=1e-4) all.equal(as.numeric(vcov(modh)),as.numeric(vcov(modg)),tolerance=1e-4) }
The inverse link function for the softmax. This function
takes the group-wise probabilities, , and computes
the expected ranks within each group under the Harville
model. That is, it is a groupwise computation of
the
erank
function.
harsm_invlink(eta, mu = smax(eta, g), g = NULL)
harsm_invlink(eta, mu = smax(eta, g), g = NULL)
eta |
a vector of the odds.
Must be the same length as |
mu |
a vector of the probabilities. Should sum to one,
at least per group.
Should be the same size as |
g |
a vector giving the group indices. If |
a vector of the ranks.
Steven E. Pav [email protected]
the ungrouped version of this, erank
.
mus <- runif(12) mus <- mus / sum(mus) harsm_invlink(mus) harsm_invlink(mus,c(rep(1,6),rep(2,6)))
mus <- runif(12) mus <- mus / sum(mus) harsm_invlink(mus) harsm_invlink(mus,c(rep(1,6),rep(2,6)))
An “experts only” softmax fitting function for the Harville model.
harsmfit(y, g, X, wt = NULL, eta0 = NULL, normalize_wt = FALSE, method = c("BFGS", "NR", "CG", "NM"))
harsmfit(y, g, X, wt = NULL, eta0 = NULL, normalize_wt = FALSE, method = c("BFGS", "NR", "CG", "NM"))
y |
a vector of the ranked outcomes within each group. Only the order within a group matters. |
g |
a vector giving the group indices. Need not be integers, but
that is more efficient. Need not be sorted.
Must be the same length as |
X |
a matrix of the independent variables. Must have as many rows
as the length of |
wt |
an optional vector of the observation level weights. These must
be non-negative, otherwise an error is thrown. Note that the weight of
the last ranked outcome within a group is essentially ignored.
Must be the same length as |
eta0 |
an optional vector of the consensus odds. These are added to
the fit odds in odds space before the likelihood caclulation. If given,
then when the model is used to predict, similar consensus odds must be
given.
Must be the same length as |
normalize_wt |
if |
method |
maximisation method, currently either "NR" (for Newton-Raphson), "BFGS" (for Broyden-Fletcher-Goldfarb-Shanno), "BFGSR" (for the BFGS algorithm implemented in R), "BHHH" (for Berndt-Hall-Hall-Hausman), "SANN" (for Simulated ANNealing), "CG" (for Conjugate Gradients), or "NM" (for Nelder-Mead). Lower-case letters (such as "nr" for Newton-Raphson) are allowed. If missing, a suitable method is selected automatically. |
Given a number of events, indexed by group, and a vector of
the ranks of each entry within that group, perform maximum likelihood
estimation under the softmax and proportional probability model.
The user can optionally supply a vector of , which are
taken as the fixed, or ‘consensus’ odds. The estimation is
then conditional on these fixed odds.
Weighted estimation is supported.
The code relies on the likelihood function of harsmlik
,
and MLE code from maxLik
.
An object of class harsm
, maxLik
, and linodds
.
Steven E. Pav [email protected]
Harville, D. A. "Assigning probabilities to the outcomes of multi-entry competitions." Journal of the American Statistical Association 68, no. 342 (1973): 312-316. http://dx.doi.org/10.1080/01621459.1973.10482425
the likelihood function, harsmlik
, and the
expected rank function (the inverse link), erank
.
nfeat <- 5 set.seed(1234) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta y <- rsm(eta,g) mod0 <- harsmfit(y=y,g=g,X=X) summary(mod0) # now upweight finishers 1-5 modw <- harsmfit(y=y,g=g,X=X,wt=1 + as.numeric(y < 6)) summary(modw)
nfeat <- 5 set.seed(1234) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta y <- rsm(eta,g) mod0 <- harsmfit(y=y,g=g,X=X) summary(mod0) # now upweight finishers 1-5 modw <- harsmfit(y=y,g=g,X=X,wt=1 + as.numeric(y < 6)) summary(modw)
A user friendly interface to the softmax regression under the Henery model.
hensm(formula, data, group = NULL, weights = NULL, ngamma = 4, na.action = na.omit) ## S3 method for class 'hensm' vcov(object, ...) ## S3 method for class 'hensm' print(x, ...)
hensm(formula, data, group = NULL, weights = NULL, ngamma = 4, na.action = na.omit) ## S3 method for class 'hensm' vcov(object, ...) ## S3 method for class 'hensm' print(x, ...)
formula |
an object of class |
data |
an optional data frame, list or environment (or object
coercible by |
group |
the string name of the group variable in the data, or a bare character with the group name. The group indices need not be integers, but that is more efficient. They need not be sorted. |
weights |
an optional vector of weights, or the string or bare name of the
weights in the |
ngamma |
The number of gammas to fit. Should be at least 2. |
na.action |
How to deal with missing values in |
object |
an object of class |
... |
additional arguments to be passed to the low level regression fitting functions (see below). |
x |
logicals. If |
Performs a softmax regression by groups, via Maximum Likelihood Estimation.
It is assumed that successive sub-races maintain the proportional
probability of the softmax, up to some gamma coefficients,
, which we fit. This model
nests the Harville model fit by
harsm
, by fixing all
the gammas equal to 1.
An object of class hensm
, but also of maxLik
with the
fit.
This regression may give odd results when the outcomes are tied, imposing an arbitrary order on the tied outcomes. Moreover, no warning may be issued in this case. In future releases, ties may be dealt with differently, perhaps in analogy to how ties are treated in the Cox Proportional Hazards regression, using the methods of Breslow or Efron.
To avoid incorrect inference when only the top performers are recorded, and all others are effectively tied, one should use weighting. Set the weights to zero for participants who are tied non-winners, and one for the rest So for example, if you observe the Gold, Silver, and Bronze medal winners of an Olympic event that had a starting field of 12 participants, set weights to 1 for the medal winners, and 0 for the others. Note that the weights do not attach to the participants, they attach to the place they took.
Since version 0.1.0 of this package, the normalization of weights used in this function have changed under the hood. This is to give correct inference in the case where zero weights are used to signify finishing places were not observed. If in doubt, please confirm inference by simulations, taking as example the simulations in the README.
Steven E. Pav [email protected]
nfeat <- 5 set.seed(1234) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta # 2FIX: do rhenery y <- rsm(eta,g) # now the pretty frontend data <- cbind(data.frame(outcome=y,race=g),as.data.frame(X)) fmla <- outcome ~ V1 + V2 + V3 + V4 + V5 fitm <- hensm(fmla,data,group=race) # with offset eta0 <- rowMeans(X) data <- cbind(data.frame(outcome=y,race=g,eta0=eta0),as.data.frame(X)) fmla <- outcome ~ offset(eta0) + V1 + V2 + V3 + V4 + V5 fitm <- hensm(fmla,data,group=race) # on horse race data library(dplyr) data(race_data) df <- race_data %>% group_by(EventId) %>% mutate(eta0=log(WN_pool / sum(WN_pool))) %>% ungroup() %>% mutate(weights=ifelse(!is.na(Finish),1,0)) %>% mutate(fac_age=cut(Age,c(0,3,5,7,Inf),include.lowest=TRUE)) # Henery Model with market efficiency hensm(Finish ~ eta0,data=df,group=EventId,weights=weights,ngamma=3) # look for age effect not captured by consensus odds. hensm(Finish ~ offset(eta0) + fac_age,data=df,group=EventId,weights=weights,ngamma=2)
nfeat <- 5 set.seed(1234) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta # 2FIX: do rhenery y <- rsm(eta,g) # now the pretty frontend data <- cbind(data.frame(outcome=y,race=g),as.data.frame(X)) fmla <- outcome ~ V1 + V2 + V3 + V4 + V5 fitm <- hensm(fmla,data,group=race) # with offset eta0 <- rowMeans(X) data <- cbind(data.frame(outcome=y,race=g,eta0=eta0),as.data.frame(X)) fmla <- outcome ~ offset(eta0) + V1 + V2 + V3 + V4 + V5 fitm <- hensm(fmla,data,group=race) # on horse race data library(dplyr) data(race_data) df <- race_data %>% group_by(EventId) %>% mutate(eta0=log(WN_pool / sum(WN_pool))) %>% ungroup() %>% mutate(weights=ifelse(!is.na(Finish),1,0)) %>% mutate(fac_age=cut(Age,c(0,3,5,7,Inf),include.lowest=TRUE)) # Henery Model with market efficiency hensm(Finish ~ eta0,data=df,group=EventId,weights=weights,ngamma=3) # look for age effect not captured by consensus odds. hensm(Finish ~ offset(eta0) + fac_age,data=df,group=EventId,weights=weights,ngamma=2)
The inverse softmax function: take a logarithm and center.
inv_smax(mu, g = NULL)
inv_smax(mu, g = NULL)
mu |
a vector of the probablities.
Must be the same length as |
g |
a vector giving the group indices. If |
This is the inverse of the softmax function. Given
vector for a single group, finds vector
such that
where is chosen such that the
sum
to zero:
the centered log probabilities.
This function can deal with overflow in a semi-coherent way.
Steven E. Pav [email protected]
# we can deal with large values: set.seed(2345) eta <- rnorm(12,sd=1000) mu <- smax(eta) eta0 <- inv_smax(mu)
# we can deal with large values: set.seed(2345) eta <- rnorm(12,sd=1000) mu <- smax(eta) eta0 <- inv_smax(mu)
Divide a vector by its sum, resulting in a vector with sum equal to one.
normalize(x)
normalize(x)
x |
vector of input data. |
the input divided by its sum. For the row-wise version, each row is divided by its sum.
This function will return NA
when any elements of the input
are NA
. May return Inf
if the elements sum
to zero.
Steven E. Pav [email protected]
Modeling of ordinal outcomes via the softmax function under the Harville and Henery models.
The Harville and Henery models describe the probability of ordered outcomes in terms of some parameters. Typically the ordered outcomes are things like place in a race, or winner among a large number of contestants. The Harville model could be described as a softmax probability for the first place finish, with a recursive model on the remaining places. The Henery model generalizes that to adjust the remaining places with another parameter.
These are best illustrated with an example. Suppose you observe a race of 20 contestants. Contestant number 11 takes first place, number 6 takes second place, and 17 takes third place, while the fourth through twentieth places are not recorded or not of interest. Under the Harville model, the probability of this outcome can be expressed as
where .
In a softmax regression under the Harville model,
one expresses the odds as
, where
are independent variables, for some
to be fit by the regression.
Under the Henery model, one adds gammas, such
that the probability of the outcome above is
There is no reason to model a as anything but one,
since it would be redundant.
The Henery softmax regression estimates the
as well as
the
.
To simplify the regression, the higher order gammas are assumed to equal
the last fit value. That is, we usually model
.
The regression supports weighted estimation as well. The weights are applied to the places, not to the participants. The weighted likelihood under the example above, for the Harville model is
The weighting mechanism is how this package deals with unobserved
places.
Rather than marking all runners-up as tied for fourth place, in this
case one sets the for
.
The regression is then not asked to make distinctions between the
tied runners-up.
This package is a work in progress. Expect breaking changes. Please file any bug reports or issues at https://github.com/shabbychef/ohenery/issues.
ohenery is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
This package is maintained as a hobby.
Steven E. Pav [email protected]
Harville, D. A. "Assigning probabilities to the outcomes of multi-entry competitions." Journal of the American Statistical Association 68, no. 342 (1973): 312-316. http://dx.doi.org/10.1080/01621459.1973.10482425
Henery, R. J. "Permutation probabilities as models for horse races." Journal of the Royal Statistical Society: Series B (Methodological) 43, no. 1 (1981): 86-91. http://dx.doi.org/10.1111/j.2517-6161.1981.tb01153.x
News for package ‘ohenery’
Change default in harsm and hensm to use unnormalized weights, correcting inference when not all finishes are observed.
first CRAN release.
Three weeks of horse race data from tracks worldwide.
data(race_data)
data(race_data)
A data.frame
object with 36,418 observations and 19 columns.
The columns are defined as follows:
EventId
An integer ID denoting the event (race). These range from 1 to 4486.
TrackId
An integer ID number of the the track. There are 64 different tracks represented.
Type
The type of event, one of “Thoroughbred” or “Harness”.
RaceNum
The integer race number within a group of races at a track on a given date.
CorrectedPostTime
The ‘corrected’ post time of the race, in the form %Y-%m-%d %H:%M:%S
,
presumably in the PDT time zone. Has values like “2019-03-05 02:30:00”.
Yards
The length of the race, in yards.
SurfaceText
A string, one of
“Turf”, “Dirt”, “All-Weather” or NA
.
HorseName
The string name of the horse.
HorseId
A unique integer ID for each horse. As different horses can have the same name, this ID is constructed from the name of the Horse, the Sire and the Dam.
Age
The age of the horse, in integer years, at the time of the event. Typically less than 10.
Sex
A single character denoting the sex of the horse. I
believe the codes are
“M” for “Mare” (female four years or older),
“G” for “Gelding”,
“F” for “Filly” (female under four years of age),
“C” for “Colt” (male under four years of age),
“H” for “Horse” (male four years of age and up),
“R” for “Rig” (hard to explain),
“A” for “???”. There are some NA
values as well.
Weight_lbs
The weight in integer pounds of the jockey and any equipment. Typically around 120.
PostPosition
The integer starting position of the horse. Typically there is a slight advantage to starting at the first or second post position.
Medication
One of several codes indicating any medication the horse may be taking at the time of the race. I believe “L” stands for “Lasix”, a common medication for lung conditions that is thought to give horses a slight boost in speed.
MorningLine
A double indicating the “morning betting line” for win bets on the horse. It is not clear how to interpret this value, perhaps it is return on a dollar. Values range from 0.40 to 80.
WN_pool
The total combined pool in win bets, in dollars, on this horse at post time.
PL_pool
The total combined pool in place bets, in dollars, on this horse at post time.
SH_pool
The total combined pool in show bets, in dollars, on this horse at post time.
Finish
The integer finishing position of the horse. A 1 means first place. We only collect values of 1, 2, and 3, while
the remaining finishing places are unknown and left as NA
.
The author makes no guarantees regarding correctness of this data.
Steven E. Pav [email protected]
Data were sourced from the web. Don't ask.
library(dplyr) data(race_data) # compute win bet efficiency efficiency <- race_data %>% group_by(EventId) %>% mutate(ImpliedOdds=WN_pool / sum(WN_pool,na.rm=TRUE)) %>% ungroup() %>% mutate(OddsBucket=cut(ImpliedOdds,c(0,0.05,seq(0.1,1,by=0.10)),include.lowest=TRUE)) %>% group_by(OddsBucket) %>% summarize(PropWin=mean(as.numeric(coalesce(Finish==1,FALSE)),na.rm=TRUE), MedImpl=median(ImpliedOdds,na.rm=TRUE), nObs=n()) %>% ungroup() if (require('ggplot2') && require('scales')) { efficiency %>% ggplot(aes(MedImpl,PropWin,size=nObs)) + geom_point() + scale_x_sqrt(labels=percent) + scale_y_sqrt(labels=percent) + geom_abline(slope=1,intercept=0,linetype=2,alpha=0.6) + labs(title='actual win probability versus implied win probability', size='# horses', x='implied win probability', y='observed win probability') }
library(dplyr) data(race_data) # compute win bet efficiency efficiency <- race_data %>% group_by(EventId) %>% mutate(ImpliedOdds=WN_pool / sum(WN_pool,na.rm=TRUE)) %>% ungroup() %>% mutate(OddsBucket=cut(ImpliedOdds,c(0,0.05,seq(0.1,1,by=0.10)),include.lowest=TRUE)) %>% group_by(OddsBucket) %>% summarize(PropWin=mean(as.numeric(coalesce(Finish==1,FALSE)),na.rm=TRUE), MedImpl=median(ImpliedOdds,na.rm=TRUE), nObs=n()) %>% ungroup() if (require('ggplot2') && require('scales')) { efficiency %>% ggplot(aes(MedImpl,PropWin,size=nObs)) + geom_point() + scale_x_sqrt(labels=percent) + scale_y_sqrt(labels=percent) + geom_abline(slope=1,intercept=0,linetype=2,alpha=0.6) + labs(title='actual win probability versus implied win probability', size='# horses', x='implied win probability', y='observed win probability') }
Given base probabilities, and Henery gamma coefficients, performs random generation, using R's built in rand seed, of the final outcome of a race for each participant.
rhenery(mu, gamma = NULL)
rhenery(mu, gamma = NULL)
mu |
a vector of the probabilities of taking first place. |
gamma |
a vector of the gamma coefficients. Should have length
one less than |
Given vectors and
, first computes
then assigns a to participant
with probability
. The ‘winning’ participant is then removed
from consideration, and the process is repeated using the remaining
and
vectors.
Typically one has that , for some
‘odds’,
.
When the are all one, you recover the Harville softmax
model.
A vector, of the same length as the probabilities, giving the entry of each horse. Note that the expected value of this returned thing makes sense, it is not the finished rank ordering of a race.
Steven E. Pav [email protected]
Generate variates from a softmax distribution under Harville or Henery models.
rsm(eta, g = NULL, mu = NULL, gamma = NULL)
rsm(eta, g = NULL, mu = NULL, gamma = NULL)
eta |
a vector of the odds.
Must be the same length as |
g |
a vector giving the group indices. If |
mu |
a vector of the probablities.
Must be the same length as |
gamma |
a vector of the gamma parameters for the
Henery model. Typically the first element should be 1.
Omit or set all values to 1 to recover the Harville model.
The last element will be extended if the gamma is
not long enough for a given group. Note that gamma is expected
to be very short, and is not ‘aligned’ with
|
Given the in odds space, and a grouping
variable, returns values in one to the number of
elements in a group. That is, we have a permutation
of 1 through
on each group as output.
For a single group, the probability that the th
element of the output is a 1 is equal to
Once an element has been selected to have output
1, remove it from the set and iterate, but with the
next elements.
a vector of integers, each a permutation of one through the number of elements in each group.
The output of this function satisfies a kind
of order invariance that rhenery
does not, at the cost of some computational inefficiency.
Namely that for a fixed randseed, and for distinct
eta
, the output is equivariant with respect
to permutation of the vector eta
when there
is only one group.
Regarding the ‘direction’, we associate higher odds with a smaller outcome. That is, the ith element of the output encodes the place that the ith participant took in the simulated ‘race’; it should be small if the odds for that participant are very high.
Steven E. Pav [email protected]
# simple use set.seed(1234) g <- ceiling(seq(1,10,by=0.1)) eta <- rnorm(length(g)) y <- rsm(eta,g=g) # same same: set.seed(235) y1 <- rsm(eta,g=g) set.seed(235) y2 <- rsm(g=g,mu=smax(eta,g=g)) y1 - y2 # the default model is Harville set.seed(1212) y1 <- rsm(eta,g=g) set.seed(1212) y2 <- rsm(eta,g=g,gamma=c(1,1,1,1)) y1 - y2 # repeat several times with the cards stack against # early runners set.seed(1234) colMeans(t(replicate(1000,rsm(sort(rnorm(10,sd=1.5)),g=rep(1,10))))) # illustrate the invariance mu <- (1:10) / 55 set.seed(1414) y1 <- rsm(mu=mu,gamma=c(1,1,1)) set.seed(1414) y2 <- rev(rsm(mu=rev(mu),gamma=c(1,1,1))) y1 - y2 nfeat <- 5 set.seed(1234) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta y <- rsm(eta,g=g) idx <- order(g,y,decreasing=TRUE) - 1 fooey <- harsmlik(g,idx,eta,deleta=X) set.seed(3493) dib <- rnorm(length(beta)) xvl <- seq(-0.01,0.01,length.out=301) rsu <- sapply(xvl, function(del) { beta1 <- beta + del * dib eta1 <- X %*% beta1 fooey2 <- harsmlik(g,idx,eta1,deleta=X) as.numeric(fooey2) - as.numeric(fooey) }) drv <- sapply(xvl, function(del) { beta1 <- beta + del * dib eta1 <- X %*% beta1 fooey2 <- harsmlik(g,idx,eta1,deleta=X) sum(attr(fooey2,'gradient') * dib) }) if (require('ggplot2') && require('dplyr')) { bestx <- xvl[which.max(rsu)] ph <- data.frame(x=xvl,lik=rsu,grd=drv) %>% ggplot(aes(x=x,y=lik)) + geom_point() + geom_line(aes(y=grd/200)) + geom_vline(xintercept=bestx,linetype=2,alpha=0.5) + geom_hline(yintercept=0,linetype=2,alpha=0.5) print(ph) } if (require('dplyr') && require('knitr')) { # expect this to be very small, almost always 1 set.seed(1234) simdraw <- replicate(10000,{ rsm(eta=c(100,rnorm(7)))[1] }) as.data.frame(table(simdraw)) %>% mutate(prob=Freq / sum(Freq)) %>% knitr::kable() # expect this to be uniform on 2 through 8 set.seed(1234) simdraw <- replicate(10000,{ rsm(eta=c(100,rnorm(7)))[2] }) as.data.frame(table(simdraw)) %>% mutate(prob=Freq / sum(Freq)) %>% knitr::kable() }
# simple use set.seed(1234) g <- ceiling(seq(1,10,by=0.1)) eta <- rnorm(length(g)) y <- rsm(eta,g=g) # same same: set.seed(235) y1 <- rsm(eta,g=g) set.seed(235) y2 <- rsm(g=g,mu=smax(eta,g=g)) y1 - y2 # the default model is Harville set.seed(1212) y1 <- rsm(eta,g=g) set.seed(1212) y2 <- rsm(eta,g=g,gamma=c(1,1,1,1)) y1 - y2 # repeat several times with the cards stack against # early runners set.seed(1234) colMeans(t(replicate(1000,rsm(sort(rnorm(10,sd=1.5)),g=rep(1,10))))) # illustrate the invariance mu <- (1:10) / 55 set.seed(1414) y1 <- rsm(mu=mu,gamma=c(1,1,1)) set.seed(1414) y2 <- rev(rsm(mu=rev(mu),gamma=c(1,1,1))) y1 - y2 nfeat <- 5 set.seed(1234) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta y <- rsm(eta,g=g) idx <- order(g,y,decreasing=TRUE) - 1 fooey <- harsmlik(g,idx,eta,deleta=X) set.seed(3493) dib <- rnorm(length(beta)) xvl <- seq(-0.01,0.01,length.out=301) rsu <- sapply(xvl, function(del) { beta1 <- beta + del * dib eta1 <- X %*% beta1 fooey2 <- harsmlik(g,idx,eta1,deleta=X) as.numeric(fooey2) - as.numeric(fooey) }) drv <- sapply(xvl, function(del) { beta1 <- beta + del * dib eta1 <- X %*% beta1 fooey2 <- harsmlik(g,idx,eta1,deleta=X) sum(attr(fooey2,'gradient') * dib) }) if (require('ggplot2') && require('dplyr')) { bestx <- xvl[which.max(rsu)] ph <- data.frame(x=xvl,lik=rsu,grd=drv) %>% ggplot(aes(x=x,y=lik)) + geom_point() + geom_line(aes(y=grd/200)) + geom_vline(xintercept=bestx,linetype=2,alpha=0.5) + geom_hline(yintercept=0,linetype=2,alpha=0.5) print(ph) } if (require('dplyr') && require('knitr')) { # expect this to be very small, almost always 1 set.seed(1234) simdraw <- replicate(10000,{ rsm(eta=c(100,rnorm(7)))[1] }) as.data.frame(table(simdraw)) %>% mutate(prob=Freq / sum(Freq)) %>% knitr::kable() # expect this to be uniform on 2 through 8 set.seed(1234) simdraw <- replicate(10000,{ rsm(eta=c(100,rnorm(7)))[2] }) as.data.frame(table(simdraw)) %>% mutate(prob=Freq / sum(Freq)) %>% knitr::kable() }
The softmax function: exponentiate a vector and then normalize.
smax(eta, g = NULL)
smax(eta, g = NULL)
eta |
numeric array of the odds. The odds are de-meaned within each group. |
g |
a vector giving the group indices. If |
Given vector for a single group, essentially
computes vector
defined by
Note that this computation should be invariant with respect
to level shifts of the , and thus we de-mean
the odds first.
the exponentiated data normalized. For the row-wise version, each row is soft maxed.
This function can deal with overflow in a semi-coherent way.
Steven E. Pav [email protected]
# we can deal with large values: set.seed(2345) eta <- rnorm(12,sd=1000) smax(eta)
# we can deal with large values: set.seed(2345) eta <- rnorm(12,sd=1000) smax(eta)
Compute the softmax log likelihood and gradient of the same.
harsmlik(g, idx, eta, wt = NULL, deleta = NULL) hensmlik(g, idx, eta, gamma, wt = NULL, deleta = NULL)
harsmlik(g, idx, eta, wt = NULL, deleta = NULL) hensmlik(g, idx, eta, gamma, wt = NULL, deleta = NULL)
g |
a vector giving the group indices. |
idx |
a vector of integers, the same length as |
eta |
a vector of the odds.
Must be the same length as |
wt |
an optional vector of non-negative elements, the same length
as |
deleta |
an optional matrix whose row count equals the number of elements
of |
gamma |
a vector of the gamma parameters. It is assumed that the
first element is |
Given vectors ,
and optionally the gradient of
with respect to some coefficients, computes the log likelihood under the
softmax. The user must provide a reverse ordering as well, which is sorted
first by the groups,
, and then, within a group, in increasing
quality order. For a race, this means that the index is in order from the
last place to the first place in that race, where the group numbers
correspond to one race.
Under the Harville model, the log likelihood on a given group, where we are indexing in forward order, is
where .
By “forward order”, we mean that
corresponds to the
participant taking first place within that group,
took second
place, and so on.
The Henery model log likelihood takes the form
for gamma parameters, .
The Henery model corresponds to the Harville model where all the gammas equal 1.
Weights in weighted estimation apply to each summand. The weight for the last place participant in a group is irrelevant. The weighted log likelihood under the Harville model is
One should think of the weights as applying to the outcome, not the participant.
The log likelihood. If deleta
is given, we add an attribute
to the scalar number, called gradient
giving the derivative.
For the Henery model we also include a term of gradgamma
which is
the gradient of the log likelihood with respect to the gamma vector.
The likelihood function does not yet support ties.
To avoid incorrect inference when only the top performers are recorded, and all others are effectively tied, one should use weighting. Set the weights to zero for participants who are tied non-winners, and one for the rest So for example, if you observe the Gold, Silver, and Bronze medal winners of an Olympic event that had a starting field of 12 participants, set weights to 1 for the medal winners, and 0 for the others. Note that the weights do not attach to the participants, they attach to the place they took.
Steven E. Pav [email protected]
Harville, D. A. "Assigning probabilities to the outcomes of multi-entry competitions." Journal of the American Statistical Association 68, no. 342 (1973): 312-316. http://dx.doi.org/10.1080/01621459.1973.10482425
Henery, R. J. "Permutation probabilities as models for horse races." Journal of the Royal Statistical Society: Series B (Methodological) 43, no. 1 (1981): 86-91. http://dx.doi.org/10.1111/j.2517-6161.1981.tb01153.x
# a garbage example set.seed(12345) g <- as.integer(sort(ceiling(20 * runif(100)))) idx <- as.integer(rev(1:length(g)) - 1L) eta <- rnorm(length(g)) foo <- harsmlik(g=g,idx=idx,eta=eta,deleta=NULL) # an example with a real idx nfeat <- 5 set.seed(1234) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta y <- rsm(eta,g) idx <- order(g,y,decreasing=TRUE) - 1 foores <- harsmlik(g,idx,eta,deleta=X) # now reweight them wt <- runif(length(g)) wt <- wt / mean(wt) # mean 1 is recommended foores <- harsmlik(g,idx,eta,wt=wt) # try hensmlik foores <- hensmlik(g,idx,eta,gamma=c(0.9,0.8,1),wt=wt) # check the value of the gradient by numerical approximation nfeat <- 8 set.seed(321) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta y <- rsm(eta,g) idx <- order(g,y,decreasing=TRUE) - 1 if (require(numDeriv)) { fastval <- attr(harsmlik(g,idx,eta,deleta=X),'gradient') numap <- grad(function(beta,g,idx,X) { eta <- X %*% beta as.numeric(harsmlik(g,idx,eta)) }, x=beta,g=g,idx=idx,X=X) rbind(fastval,numap) }
# a garbage example set.seed(12345) g <- as.integer(sort(ceiling(20 * runif(100)))) idx <- as.integer(rev(1:length(g)) - 1L) eta <- rnorm(length(g)) foo <- harsmlik(g=g,idx=idx,eta=eta,deleta=NULL) # an example with a real idx nfeat <- 5 set.seed(1234) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta y <- rsm(eta,g) idx <- order(g,y,decreasing=TRUE) - 1 foores <- harsmlik(g,idx,eta,deleta=X) # now reweight them wt <- runif(length(g)) wt <- wt / mean(wt) # mean 1 is recommended foores <- harsmlik(g,idx,eta,wt=wt) # try hensmlik foores <- hensmlik(g,idx,eta,gamma=c(0.9,0.8,1),wt=wt) # check the value of the gradient by numerical approximation nfeat <- 8 set.seed(321) g <- ceiling(seq(0.1,1000,by=0.1)) X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat) beta <- rnorm(nfeat) eta <- X %*% beta y <- rsm(eta,g) idx <- order(g,y,decreasing=TRUE) - 1 if (require(numDeriv)) { fastval <- attr(harsmlik(g,idx,eta,deleta=X),'gradient') numap <- grad(function(beta,g,idx,X) { eta <- X %*% beta as.numeric(harsmlik(g,idx,eta)) }, x=beta,g=g,idx=idx,X=X) rbind(fastval,numap) }