Using the text from Claims to predict losses.
-
A recent paper in the Annals of Actuarial Science, "Extracting information from textual descriptions for actuarial applications" presents a framework for the automated predicting of insurance claim amounts from the textual descriptions of the claims.
The paper presents a framework for the automated predicting of insurance claim amounts from the textual descriptions of the claims. The method is a general high-dimensional text analysis with a three-stage approach adopting modern statistical methods, based around a Generalized Additive Model (GAM). The method is designed to work well when there is a non-linear relationship between the predictor and the covariates, and the problem has high dimensionality.
A GAM is a flexible statistical model that extends generalized linear models to allow non-linear relationships between predictors and response. Smooth functions are used to model effects of predictors, providing a data-driven and nonparametric approach. While maintaining interpretability, GAMs can reveal non-linear trends and complex patterns.
The advantage of using a GAM rather than a than “black-box” machine learning algorithms such as random forest, or neural networks is that it is possible to plot each estimated selected function and gain insights into the factors affecting claim amounts.
The text from the claims is converted into numeric vectors, using word embeddings and cosine similarities, creating a high-dimensional design matrix. To handle the high dimensionality, a 3-step approach is used:
1. Group lasso to reduce dimensions and get weights
2. Adaptive group lasso using weights from step 1 for further dimension reduction
3. Fit a Generalized Additive Model (GAM) on the remaining variables to estimate smooth functions
The method is applied to a dataset of storm damage claim descriptions and amounts. The final model has 149 variables and shows good out-of-sample predictive performance.
Strengths
• Handles high dimensionality effectively while retaining interpretability
• Computationally efficient and scalable
• Provides a general framework applicable to various text analysis problems
• Strong theoretical justification for variable selection consistency and estimation consistency
Weaknesses
• Preprocessing/feature extraction from text just uses simple cosine similarity
• Confidence bands for function estimates need more investigation
• Comparison with machine learning models would be useful
The paper demonstrates a practical and scalable automated method for adding textural claims information to the estimation of losses for actuarial reserving and ratemaking.