The paneldesc package
provides a comprehensive set of tools for analyzing panel (longitudinal)
data. It helps you explore the structure of your panel, examine missing
value patterns, decompose numeric variables into between‑ and
within‑entity components, and analyze transitions in categorical
variables. The package is designed to work seamlessly with data frames
that have been marked with panel structure using
make_panel(), reducing repetitive specification of entity
and time identifiers.
This vignette walks you through the basic workflow using the built‑in production dataset, a simulated unbalanced panel of firms over six years.
For a comprehensive guide with detailed examples, case studies, and extended tutorials, please visit the package web-book: https://dtereshch.github.io/paneldesc-guides/.
If you haven’t installed the package yet, you can get the stable version from CRAN.
Or you can install the development version from GitHub.
The package includes a simulated dataset called
production. It contains information on 30 firms over up to
6 years, with variables such as sales,
capital, labor, industry, and
ownership. Missing values are present in some variables to
mimic real‑world data.
To avoid repeatedly specifying the entity and time variables (firm
and year), we create a panel_data object using
make_panel(). This adds metadata that many subsequent
functions will automatically use.
The first group of functions is designed to analyze the structure of the panel.
describe_dimensions() returns the number of rows,
distinct entities, distinct time periods, and substantive variables.
describe_periods() shows, for each time period, how many
entities have non‑missing data in any substantive variable, along with
their share in the total number of entities.
describe_periods(panel)
#> year count share
#> 1 1 25 0.833
#> 2 2 28 0.933
#> 3 3 30 1.000
#> 4 4 29 0.967
#> 5 5 26 0.867
#> 6 6 19 0.633describe_balance() provides summary statistics for the
distribution of entities per period and periods per entity.
describe_balance(panel)
#> dimension mean std min max
#> 1 entities 26.167 3.971 19 30
#> 2 periods 5.233 0.935 3 6plot_periods() creates a histogram of the number of time
periods covered by each entity.
describe_patterns() tabulates the distinct patterns of
presence/absence across time (e.g., which entities appear in which
years).
describe_patterns(panel)
#> pattern 1 2 3 4 5 6 count share
#> 1 1 1 1 1 1 1 1 16 0.533
#> 2 2 1 1 1 1 1 0 5 0.167
#> 3 3 1 1 1 1 0 0 3 0.100
#> 4 4 0 0 1 1 1 1 2 0.067
#> 5 5 0 1 1 1 1 0 2 0.067
#> 6 6 0 1 1 1 1 1 1 0.033
#> 7 7 1 1 1 0 0 0 1 0.033You can also visualize these patterns with a heatmap using
plot_patterns().
The second group of functions is aimed at analyzing missing values, taking into account the nature of panel data.
plot_missing() creates a heatmap showing the number of
missing values for each variable across all time periods. Darker cells
indicate more missing values.
summarize_missing() returns a table with overall missing
counts, shares, and the number of entities and periods affected per
variable.
summarize_missing(panel)
#> Analyzing all variables: sales, capital, labor, industry, ownership, region
#> variable na_count na_share entities periods
#> 1 sales 26 0.144 15 6
#> 2 capital 26 0.144 17 6
#> 3 labor 26 0.144 15 6
#> 4 industry 23 0.128 14 5
#> 5 ownership 23 0.128 14 5
#> 6 region 23 0.128 14 5describe_incomplete() lists entities that have at least
one missing value, with details on which variables are incomplete.
The third group of functions is aimed at analyzing numeric variables, taking into account the nature of panel data.
summarize_numeric() calculates basic statistics (count,
mean, std, min, max) for numeric variables.
summarize_numeric(panel)
#> Analyzing all numeric variables: sales, capital, labor
#> variable count mean std min max
#> 1 sales 154 68.402 45.025 11.999 292.850
#> 2 capital 154 33.152 32.044 2.030 160.085
#> 3 labor 154 76.883 74.150 5.972 579.024You can optionally group by another variable, which does not
necessarily have to be a panel identifier. Here we use
year.
summarize_numeric(panel, group = "year")
#> Analyzing all numeric variables: sales, capital, labor
#> year variable count mean std min max
#> 1 1 sales 25 72.304 43.040 11.999 192.591
#> 2 1 capital 25 40.154 39.004 3.206 148.942
#> 3 1 labor 24 74.349 69.766 18.414 333.377
#> 4 2 sales 28 67.557 35.056 20.375 150.742
#> 5 2 capital 28 30.210 30.596 5.032 122.332
#> 6 2 labor 28 81.758 66.571 17.666 327.581
#> 7 3 sales 29 72.089 64.525 17.320 292.850
#> 8 3 capital 28 30.111 30.372 2.030 121.740
#> 9 3 labor 29 82.256 106.869 8.172 579.024
#> 10 4 sales 27 56.859 31.707 12.000 136.687
#> 11 4 capital 28 34.889 32.046 7.146 148.994
#> 12 4 labor 29 55.533 46.448 5.972 207.572
#> 13 5 sales 26 62.159 41.237 19.773 189.069
#> 14 5 capital 26 28.076 21.871 4.637 78.244
#> 15 5 labor 25 74.056 71.255 15.879 266.829
#> 16 6 sales 19 83.835 45.564 25.986 191.585
#> 17 6 capital 19 37.145 39.400 3.690 160.085
#> 18 6 labor 19 101.003 67.261 18.669 259.851plot_heterogeneity() visualizes the distribution of a
numeric variable across groups. We use select = "sales" to
look at sales, and the function automatically uses the
entity and time variables as groups because panel has panel
attributes.
decompose_numeric() splits the total variance of numeric
variables into between‑entity and within‑entity components.
decompose_numeric(panel)
#> Analyzing all numeric variables: sales, capital, labor
#> variable dimension mean std min max count
#> 1 sales overall 68.402 45.025 11.999 292.850 154.000
#> 2 sales between NA 29.060 34.263 166.364 30.000
#> 3 sales within NA 34.127 -12.444 234.297 5.133
#> 4 capital overall 33.152 32.044 2.030 160.085 154.000
#> 5 capital between NA 17.414 9.019 74.225 30.000
#> 6 capital within NA 27.072 -25.567 149.329 5.133
#> 7 labor overall 76.883 74.150 5.972 579.024 154.000
#> 8 labor between NA 41.068 31.021 190.645 30.000
#> 9 labor within NA 61.202 -58.040 483.217 5.133The last group of functions is aimed at analyzing factor (categorical) variables, taking into account the nature of panel data.
decompose_factor() breaks down the overall frequency of
each category into between‑entity (how many entities ever have that
category) and within‑entity (average share of time an entity spends in
that category) components.
decompose_factor(panel)
#> Analyzing all factor variables: industry, ownership, region
#> variable category count_overall share_overall count_between share_between
#> 1 industry Industry 1 63 0.401 13 0.433
#> 2 industry Industry 2 45 0.287 11 0.367
#> 3 industry Industry 3 49 0.312 10 0.333
#> 4 ownership private 80 0.510 17 0.567
#> 5 ownership public 36 0.229 9 0.300
#> 6 ownership mixed 41 0.261 10 0.333
#> 7 region west 38 0.242 7 0.233
#> 8 region east 40 0.255 8 0.267
#> 9 region north 36 0.229 7 0.233
#> 10 region south 43 0.274 8 0.267
#> share_within
#> 1 0.918
#> 2 0.809
#> 3 0.917
#> 4 0.894
#> 5 0.787
#> 6 0.772
#> 7 1.000
#> 8 1.000
#> 9 1.000
#> 10 1.000summarize_transition() computes transition counts and
shares between states of a factor variable over consecutive time
periods. Here we analyze transitions in ownership.