| Title: | Visualization and Analysis of Nominal Variable Distributions |
|---|---|
| Description: | Provides tools for visualizing and analyzing the shape of discrete nominal frequency distributions. The package introduces centered frequency plots, in which nominal categories are ordered from the most frequent category at the center toward less frequent categories on both sides, facilitating the detection of distributional patterns such as uniformity, dominance, symmetry, skewness, and long-tail behavior. In addition, the package supports Pareto charts for the study of dominance and cumulative frequency structure in nominal data. The package is designed for exploratory data analysis and statistical teaching, offering visualizations that emphasize distributional form rather than arbitrary category ordering. |
| Authors: | Norberto Asensio [aut, cre] (ORCID: <https://orcid.org/0000-0003-4536-5073>) |
| Maintainer: | Norberto Asensio <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.2 |
| Built: | 2026-05-20 09:43:45 UTC |
| Source: | https://github.com/norberello/nomishape |
Tokenized words from Alice's Adventures in Wonderland by Lewis Carroll. Each row represents a single word occurrence, allowing analysis of word frequency distributions and demonstration of Zipf-like behavior in natural language.
alicealice
A data frame with one column:
Character. Tokenized word from the text.
Public domain text (Lewis Carroll, 1865)
A dataset of dummy nominal data inspired by characters/species from the Bikini Bottom universe (SpongeBob SquarePants). This dataset simulates a roughly uniform distribution across 11 species, with a total of 250 observations. It was intentionally designed to be uniform-like for testing nominal distribution visualization functions.
A simple dataset of categorical values used for examples.
categories categoriescategories categories
A data frame with 250 rows and 1 variable:
Character. Species/animal names. 11 species inspired by Bikini Bottom.
A data frame with 1 column:
Factor with animal categories as letters
Generated for examples
categories # Ranked bar plot of species frequencies ranked_barplot(categories, "animal") # Centered bar plot (most frequent in the center) centered_barplot(categories, "animal") # Centered dot plot with theoretical shape overlays shape_comp_plot(categories, "animal")categories # Ranked bar plot of species frequencies ranked_barplot(categories, "animal") # Centered bar plot (most frequent in the center) centered_barplot(categories, "animal") # Centered dot plot with theoretical shape overlays shape_comp_plot(categories, "animal")
A dataset of dummy nominal data inspired by characters/species from the Bikini Bottom universe (SpongeBob SquarePants). This dataset simulates a roughly triangular distribution of frequencies.
categories2categories2
A data frame with 250 rows and 11 variables:
Character. Species/animal names.
Integer. Frequency of each species, forming a triangular pattern.
ranked_barplot(categories2, "animal")ranked_barplot(categories2, "animal")
A dataset of dummy nominal data inspired by characters/species from the Bikini Bottom universe (SpongeBob SquarePants). This dataset simulates a highly skewed distribution where a few species dominate most of the frequency (long-tail / exponential pattern). It was intentionally designed for pedagogical purposes to demonstrate dominance and Pareto-like behavior in nominal data.
categories3categories3
A data frame with 250 rows and 1 variable:
Character. Species/animal names. 11 species inspired by Bikini Bottom.
categories3 # Centered dot plot showing exponential/long-tail pattern shape_comp_plot(categories3, "animal") # Pareto chart highlighting cumulative frequency and dominance pareto(categories3, "animal") # Optional: ranked or centered bar plots ranked_barplot(categories3, "animal") centered_barplot(categories3, "animal")categories3 # Centered dot plot showing exponential/long-tail pattern shape_comp_plot(categories3, "animal") # Pareto chart highlighting cumulative frequency and dominance pareto(categories3, "animal") # Optional: ranked or centered bar plots ranked_barplot(categories3, "animal") centered_barplot(categories3, "animal")
A dataset of dummy nominal data inspired by characters or species from the Bikini Bottom universe (SpongeBob SquarePants).
categories4categories4
A data frame with 250 rows and 1 variable:
Character or species name (11 categories inspired by Bikini Bottom).
The dataset represents a structured nominal distribution in which a limited number of categories dominate, followed by a gradual and approximately symmetric decline in frequencies. This pattern is consistent with a triangular or normal-like shape rather than a strongly long-tailed (Pareto/exponential) distribution.
The dataset was intentionally designed for pedagogical purposes to illustrate dominance, symmetry, and modal structure in nominal data, and to serve as a contrast with truly long-tailed distributions included elsewhere in the package.
categories4 # Centered dot plot showing a structured (normal-like) pattern shape_comp_plot(categories4, "animal") # Pareto chart showing dominance without a strong long tail pareto(categories4, "animal") # Ranked and centered bar plots ranked_barplot(categories4, "animal") centered_barplot(categories4, "animal")categories4 # Centered dot plot showing a structured (normal-like) pattern shape_comp_plot(categories4, "animal") # Pareto chart showing dominance without a strong long tail pareto(categories4, "animal") # Ranked and centered bar plots ranked_barplot(categories4, "animal") centered_barplot(categories4, "animal")
Centered Frequency Bar Plot for Nominal Variables Creates a centered bar plot for discrete nominal variables by placing the most frequent category at the center and progressively less frequent categories alternately to the left and right.
centered_barplot(df, var, title = NULL, scale = c("count", "percent"))centered_barplot(df, var, title = NULL, scale = c("count", "percent"))
df |
A data frame containing the nominal variable. |
var |
A character string giving the name of the nominal variable in |
title |
Optional character string specifying the plot title. |
scale |
Character string specifying the scale of the frequencies:
|
A ggplot2 object.
centered_barplot(categories, "animal") centered_barplot(categories, "animal", scale = "percent")centered_barplot(categories, "animal") centered_barplot(categories, "animal", scale = "percent")
Creates a centered dot plot for a nominal variable, ordering categories from the most frequent at the center toward less frequent categories on both sides. Optionally connects points with a line and shades the area under the line.
centered_dotplot( df, var, connect = FALSE, shade = FALSE, scale = c("count", "percent") )centered_dotplot( df, var, connect = FALSE, shade = FALSE, scale = c("count", "percent") )
df |
A data.frame or tibble containing the variable. |
var |
Character. Name of the nominal variable in |
connect |
Logical; if TRUE, connects points with a line. |
shade |
Logical; if TRUE, shades the area under the line (requires connect = TRUE). |
scale |
Character; either |
A ggplot2 object.
centered_dotplot(categories, "animal") centered_dotplot(categories, "animal", connect = TRUE) centered_dotplot(categories, "animal", connect = TRUE, shade = TRUE) centered_dotplot(mpg, "manufacturer", scale = "percent")centered_dotplot(categories, "animal") centered_dotplot(categories, "animal", connect = TRUE) centered_dotplot(categories, "animal", connect = TRUE, shade = TRUE) centered_dotplot(mpg, "manufacturer", scale = "percent")
Computes a measure of how concentrated counts are around the center of a nominal variable, based on the centered plotting order.
central_concentration(df, var, top_k = 3, weighted = FALSE)central_concentration(df, var, top_k = 3, weighted = FALSE)
df |
A data.frame or tibble containing the variable. |
var |
Character. Name of the nominal variable in |
top_k |
Numeric. Number of central categories to consider (default: 3). |
weighted |
Logical. If TRUE, applies a weight decreasing with distance from center. |
A numeric value between 0 and 1 representing the central concentration.
central_concentration(categories, "animal") central_concentration(categories2, "animal", top_k = 5) central_concentration(categories3, "animal", weighted = TRUE)central_concentration(categories, "animal") central_concentration(categories2, "animal", top_k = 5) central_concentration(categories3, "animal", weighted = TRUE)
Computes dominance for a nominal variable using the Simpson index, quantifying the degree to which a few categories dominate the distribution.
dominance_index(df, var)dominance_index(df, var)
df |
A data.frame or tibble containing the nominal variable. |
var |
Character. Name of the nominal variable in |
Dominance is calculated as:
where is the relative frequency of category .
Higher values indicate stronger dominance by fewer categories.
A numeric value representing dominance.
dominance_index(categories, "animal") dominance_index(categories2, "animal") dominance_index(categories3, "animal")dominance_index(categories, "animal") dominance_index(categories2, "animal") dominance_index(categories3, "animal")
Tokenized words from The Metamorphosis by Franz Kafka. Each row represents a single word occurrence, allowing analysis of word frequency distributions and comparison with Zipf's law.
kafkakafka
A data frame with one column:
Character. Tokenized word from the text.
Public domain text (Franz Kafka, 1915; English translation)
Car fuel economy data (from ggplot2) for examples.
mpgmpg
A data frame
ggplot2::mpg
Creates a Pareto chart for a nominal variable, displaying frequencies and cumulative percentages.
pareto(df, var, show_table = TRUE)pareto(df, var, show_table = TRUE)
df |
A data.frame or tibble containing the variable. |
var |
Character. Name of the variable in |
show_table |
Logical; if TRUE, prints the frequency table. Default is FALSE. |
A ggplot2 object representing the Pareto chart.
pareto(categories, "animal")pareto(categories, "animal")
Computes Pielou's evenness index based on Shannon entropy for a nominal variable recorded as individual-level observations.
pielou_evenness(df, var)pielou_evenness(df, var)
df |
A data.frame or tibble containing the nominal variable. |
var |
Character string giving the name of the nominal variable in |
Pielou's evenness is defined as:
where is Shannon entropy and is the number of observed categories.
Values range from 0 (complete dominance by one category) to 1 (perfectly even distribution).
A numeric value representing Pielou's evenness.
pielou_evenness(categories, "animal") pielou_evenness(categories2, "animal") pielou_evenness(categories3, "animal")pielou_evenness(categories, "animal") pielou_evenness(categories2, "animal") pielou_evenness(categories3, "animal")
Creates a bar plot for a nominal variable, with categories ordered from most frequent to least frequent.
ranked_barplot(df, var, scale = c("count", "percent"), title = NULL)ranked_barplot(df, var, scale = c("count", "percent"), title = NULL)
df |
A data.frame or tibble containing the variable. |
var |
Character string giving the name of the variable in |
scale |
Character; either |
title |
Optional character string specifying the plot title. |
A ggplot2 object representing the ranked bar plot.
ranked_barplot(categories, "animal") ranked_barplot(categories, "animal", scale = "percent")ranked_barplot(categories, "animal") ranked_barplot(categories, "animal", scale = "percent")
Creates a ranked dot plot for a nominal variable, displaying category frequencies or percentages from highest to lowest. Optionally connects points with a line and shades the area under the line.
ranked_dotplot( df, var, connect = FALSE, shade = FALSE, scale = c("count", "percent") )ranked_dotplot( df, var, connect = FALSE, shade = FALSE, scale = c("count", "percent") )
df |
A data.frame or tibble containing the variable. |
var |
Character. Name of the nominal variable in |
connect |
Logical; if TRUE, connects points with a line. |
shade |
Logical; if TRUE, shades the area under the line. Default is FALSE. |
scale |
Character; either |
A ggplot2 object.
ranked_dotplot(categories, "animal") ranked_dotplot(categories, "animal", connect = TRUE) ranked_dotplot(categories, "animal", connect = TRUE, shade = TRUE) ranked_dotplot(mpg, "manufacturer", scale = "percent")ranked_dotplot(categories, "animal") ranked_dotplot(categories, "animal", connect = TRUE) ranked_dotplot(categories, "animal", connect = TRUE, shade = TRUE) ranked_dotplot(mpg, "manufacturer", scale = "percent")
Generates a rarefaction curve showing the expected number of distinct categories discovered as sampling effort increases. The curve is estimated using Monte Carlo permutations of the observation order.
rare_plot(df, var, reps = 1000, max_effort = NULL)rare_plot(df, var, reps = 1000, max_effort = NULL)
df |
A data frame containing the nominal variable. |
var |
Character string specifying the nominal variable column. |
reps |
Number of random permutations used to estimate the curve. The default is 1000. Smaller values can be used to reduce computation time when working with large datasets, at the cost of less precise confidence intervals. |
max_effort |
Maximum sampling effort to compute. If NULL (default), the full sample size is used. For very large datasets, this argument allows users to limit the rarefaction curve to a smaller number of observations in order to explore how quickly categories accumulate and to approximate the minimum sample size required to capture most of the category diversity. |
Invisibly returns a data frame containing:
effort: sampling effort
mean: expected number of categories
lowCI: lower confidence interval
highCI: upper confidence interval
rare_plot(categories3, "animal") rare_plot(ufo, "shape", reps = 25, max_effort = 500)rare_plot(categories3, "animal") rare_plot(ufo, "shape", reps = 25, max_effort = 500)
Computes the multinomial log-likelihood of observed counts against four theoretical distributions (uniform, triangular, normal-like, and exponential/Pareto-like) and returns AIC and DeltaAIC values.
shape_aic(df, var, rate_exp = 0.7, eps = 1e-12)shape_aic(df, var, rate_exp = 0.7, eps = 1e-12)
df |
A data.frame or tibble containing the nominal variable. |
var |
Character string giving the name of the nominal variable in |
rate_exp |
Numeric. Default exponential rate. Only used if tail not clearly exponential. |
eps |
Small numeric value added to probabilities to avoid log(0). Default is 1e-12. |
A data.frame with columns: Shape, AIC, DeltaAIC.
Plots a centered dotplot of a nominal variable and overlays four theoretical distributions: uniform, triangular, exponential (Pareto-like), and normal-like.
shape_comp_plot(df, var, rate_exp = 0.7, scale = c("count", "percent"))shape_comp_plot(df, var, rate_exp = 0.7, scale = c("count", "percent"))
df |
A data.frame or tibble containing the nominal variable. |
var |
Character string giving the name of the nominal variable in |
rate_exp |
Numeric. Rate parameter for the exponential distribution (Pareto-like). Default is 0.7. |
scale |
Character. Whether to scale frequencies as counts ("count") or percentages ("percent"). Default is "count". |
The function orders categories from most frequent at the center outwards. Observed frequencies are plotted as points and lines, and each theoretical distribution is overlaid with a different color and line type.
A ggplot2 object.
shape_comp_plot(categories, "animal") shape_comp_plot(categories2, "animal") shape_comp_plot(categories3, "animal")shape_comp_plot(categories, "animal") shape_comp_plot(categories2, "animal") shape_comp_plot(categories3, "animal")
Character info from Star Wars (from dplyr/ggplot2 examples)
starwarsstarwars
A data frame
dplyr::starwars
Computes the proportion of categories contributing to the lower part of the distribution. Useful to quantify long-tail structure in nominal distributions.
tail_index(df, var, threshold = 0.8)tail_index(df, var, threshold = 0.8)
df |
A data.frame or tibble containing the variable. |
var |
Character. Name of the nominal variable in |
threshold |
Numeric. Cumulative proportion of counts defining the "dominant" categories (default 0.8). |
Numeric between 0 and 1 representing the tail proportion.
tail_index(categories3, "animal") tail_index(categories2, "animal", threshold = 0.9)tail_index(categories3, "animal") tail_index(categories2, "animal", threshold = 0.9)
Sample data of UFO sightings used for examples.
A large real-world dataset of UFO sighting reports collected by the National UFO Reporting Center (NUFORC), a non-profit organization dedicated to the collection and dissemination of objective UFO data.
ufo ufoufo ufo
A data frame with 8 columns:
Character. City where the UFO was reported.
Character. Description or comments about the sighting.
Date. Date of the sighting.
Numeric. Duration of the sighting in seconds.
Numeric. Latitude of the sighting location.
Numeric. Longitude of the sighting location.
Factor. Shape of the UFO observed. This is the nominal variable of interest.
Character. State of the sighting location.
A data frame with 63,561 rows and 8 variables:
Character. Date of the UFO sighting (YYYY-MM-DD).
Numeric. Latitude of the sighting location.
Numeric. Longitude of the sighting location.
Character. City where the sighting occurred.
Character. State or region of the sighting.
Character. Reported shape of the UFO (nominal variable of interest).
Numeric. Duration of the sighting in seconds.
Character. Free-text comments describing the sighting.
The dataset contains over 63,000 reported sightings spanning several decades and includes information on sighting date, geographic location, duration, narrative comments, and—most importantly for nomiShape— the reported shape of the observed object.
The shape variable is a nominal variable with many categories
(e.g., "light", "circle", "triangle", "sphere"), exhibiting strong
dominance by a few common shapes followed by a gradual decline across
rarer categories. Despite the presence of a highly frequent leading
category ("light"), the overall frequency structure is better described
as triangular or normal-like rather than strictly exponential or Pareto.
This dataset is included as a realistic, large-sample example for exploring dominance, modality, and shape classification of nominal distributions using visual and information-theoretic tools.
Example dataset
National UFO Reporting Center (NUFORC), https://nuforc.org
ufo # Centered bar plot highlighting dominance and symmetry centered_barplot(ufo, "shape") # Centered dot plot with connections and shading centered_dotplot(ufo, "shape", connect = TRUE, shade = TRUE) # Shape comparison plot shape_comp_plot(ufo, "shape") # AIC-based shape classification shape_aic(ufo, "shape")ufo # Centered bar plot highlighting dominance and symmetry centered_barplot(ufo, "shape") # Centered dot plot with connections and shading centered_dotplot(ufo, "shape", connect = TRUE, shade = TRUE) # Shape comparison plot shape_comp_plot(ufo, "shape") # AIC-based shape classification shape_aic(ufo, "shape")
Generates a rank-frequency plot comparing observed category frequencies with the expected Zipf distribution (inverse rank relationship).
zipf_rank_plot(df, var, max_rank = NULL, top_prop = NULL, loglog = FALSE)zipf_rank_plot(df, var, max_rank = NULL, top_prop = NULL, loglog = FALSE)
df |
A data frame containing the nominal variable. |
var |
Character string specifying the nominal variable column. |
max_rank |
Maximum number of ranks to display. If NULL (default), all ranks are shown. |
top_prop |
Proportion of total observations to retain (0–1). If set, only the most frequent categories accounting for this cumulative proportion are displayed. Overrides max_rank. |
loglog |
Logical. If TRUE, both axes are displayed on a log10 scale. |
Invisibly returns a data frame with rank-frequency information.
zipf_rank_plot(kafka, "word") zipf_rank_plot(alice, "word", loglog=TRUE) zipf_rank_plot(alice, "word", max_rank = 250)zipf_rank_plot(kafka, "word") zipf_rank_plot(alice, "word", loglog=TRUE) zipf_rank_plot(alice, "word", max_rank = 250)