Getting Started with paneldesc

The paneldesc package provides a comprehensive set of tools for analyzing panel (longitudinal) data. It helps you explore the structure of your panel, examine missing value patterns, decompose numeric variables into between‑ and within‑entity components, and analyze transitions in categorical variables. The package is designed to work seamlessly with data frames that have been marked with panel structure using make_panel(), reducing repetitive specification of entity and time identifiers.

This vignette walks you through the basic workflow using the built‑in production dataset, a simulated unbalanced panel of firms over six years.

For a comprehensive guide with detailed examples, case studies, and extended tutorials, please visit the package web-book: https://dtereshch.github.io/paneldesc-guides/.

Installation

If you haven’t installed the package yet, you can get the stable version from CRAN.

install.packages("paneldesc")

Or you can install the development version from GitHub.

# install.packages("devtools")
devtools::install_github("dtereshch/paneldesc")

Loading the package

Load the package.

library(paneldesc)

Data import

The package includes a simulated dataset called production. It contains information on 30 firms over up to 6 years, with variables such as sales, capital, labor, industry, and ownership. Missing values are present in some variables to mimic real‑world data.

data(production)

To avoid repeatedly specifying the entity and time variables (firm and year), we create a panel_data object using make_panel(). This adds metadata that many subsequent functions will automatically use.

panel <- make_panel(production, index = c("firm", "year"))

Panel data structure analysis

The first group of functions is designed to analyze the structure of the panel.

describe_dimensions() returns the number of rows, distinct entities, distinct time periods, and substantive variables.

describe_dimensions(panel)
#>   rows entities periods variables
#> 1  180       30       6         6

describe_periods() shows, for each time period, how many entities have non‑missing data in any substantive variable, along with their share in the total number of entities.

describe_periods(panel)
#>   year count share
#> 1    1    25 0.833
#> 2    2    28 0.933
#> 3    3    30 1.000
#> 4    4    29 0.967
#> 5    5    26 0.867
#> 6    6    19 0.633

describe_balance() provides summary statistics for the distribution of entities per period and periods per entity.

describe_balance(panel)
#>   dimension   mean   std min max
#> 1  entities 26.167 3.971  19  30
#> 2   periods  5.233 0.935   3   6

plot_periods() creates a histogram of the number of time periods covered by each entity.

plot_periods(panel)

describe_patterns() tabulates the distinct patterns of presence/absence across time (e.g., which entities appear in which years).

describe_patterns(panel)
#>   pattern 1 2 3 4 5 6 count share
#> 1       1 1 1 1 1 1 1    16 0.533
#> 2       2 1 1 1 1 1 0     5 0.167
#> 3       3 1 1 1 1 0 0     3 0.100
#> 4       4 0 0 1 1 1 1     2 0.067
#> 5       5 0 1 1 1 1 0     2 0.067
#> 6       6 0 1 1 1 1 1     1 0.033
#> 7       7 1 1 1 0 0 0     1 0.033

You can also visualize these patterns with a heatmap using plot_patterns().

plot_patterns(panel)

Missing values analysis

The second group of functions is aimed at analyzing missing values, taking into account the nature of panel data.

plot_missing() creates a heatmap showing the number of missing values for each variable across all time periods. Darker cells indicate more missing values.

plot_missing(panel)
#> Analysing all variables: sales, capital, labor, industry, ownership, region

summarize_missing() returns a table with overall missing counts, shares, and the number of entities and periods affected per variable.

summarize_missing(panel)
#> Analyzing all variables: sales, capital, labor, industry, ownership, region
#>    variable na_count na_share entities periods
#> 1     sales       26    0.144       15       6
#> 2   capital       26    0.144       17       6
#> 3     labor       26    0.144       15       6
#> 4  industry       23    0.128       14       5
#> 5 ownership       23    0.128       14       5
#> 6    region       23    0.128       14       5

describe_incomplete() lists entities that have at least one missing value, with details on which variables are incomplete.

describe_incomplete(panel)
#>    firm na_count variables
#> 1    23       18         6
#> 2     6       13         6
#> 3     7       13         6
#> 4     1       12         6
#> 5     2       12         6
#> 6    12       12         6
#> 7    21       12         6
#> 8    26       12         6
#> 9    25        7         6
#> 10   30        7         6
#> 11    4        6         6
#> 12   13        6         6
#> 13   17        6         6
#> 14   29        6         6
#> 15   14        2         2
#> 16   10        1         1
#> 17   22        1         1
#> 18   27        1         1

Numeric variables analysis

The third group of functions is aimed at analyzing numeric variables, taking into account the nature of panel data.

summarize_numeric() calculates basic statistics (count, mean, std, min, max) for numeric variables.

summarize_numeric(panel)
#> Analyzing all numeric variables: sales, capital, labor
#>   variable count   mean    std    min     max
#> 1    sales   154 68.402 45.025 11.999 292.850
#> 2  capital   154 33.152 32.044  2.030 160.085
#> 3    labor   154 76.883 74.150  5.972 579.024

You can optionally group by another variable, which does not necessarily have to be a panel identifier. Here we use year.

summarize_numeric(panel, group = "year")
#> Analyzing all numeric variables: sales, capital, labor
#>    year variable count    mean     std    min     max
#> 1     1    sales    25  72.304  43.040 11.999 192.591
#> 2     1  capital    25  40.154  39.004  3.206 148.942
#> 3     1    labor    24  74.349  69.766 18.414 333.377
#> 4     2    sales    28  67.557  35.056 20.375 150.742
#> 5     2  capital    28  30.210  30.596  5.032 122.332
#> 6     2    labor    28  81.758  66.571 17.666 327.581
#> 7     3    sales    29  72.089  64.525 17.320 292.850
#> 8     3  capital    28  30.111  30.372  2.030 121.740
#> 9     3    labor    29  82.256 106.869  8.172 579.024
#> 10    4    sales    27  56.859  31.707 12.000 136.687
#> 11    4  capital    28  34.889  32.046  7.146 148.994
#> 12    4    labor    29  55.533  46.448  5.972 207.572
#> 13    5    sales    26  62.159  41.237 19.773 189.069
#> 14    5  capital    26  28.076  21.871  4.637  78.244
#> 15    5    labor    25  74.056  71.255 15.879 266.829
#> 16    6    sales    19  83.835  45.564 25.986 191.585
#> 17    6  capital    19  37.145  39.400  3.690 160.085
#> 18    6    labor    19 101.003  67.261 18.669 259.851

plot_heterogeneity() visualizes the distribution of a numeric variable across groups. We use select = "sales" to look at sales, and the function automatically uses the entity and time variables as groups because panel has panel attributes.

plot_heterogeneity(panel, select = "sales")

decompose_numeric() splits the total variance of numeric variables into between‑entity and within‑entity components.

decompose_numeric(panel)
#> Analyzing all numeric variables: sales, capital, labor
#>   variable dimension   mean    std     min     max   count
#> 1    sales   overall 68.402 45.025  11.999 292.850 154.000
#> 2    sales   between     NA 29.060  34.263 166.364  30.000
#> 3    sales    within     NA 34.127 -12.444 234.297   5.133
#> 4  capital   overall 33.152 32.044   2.030 160.085 154.000
#> 5  capital   between     NA 17.414   9.019  74.225  30.000
#> 6  capital    within     NA 27.072 -25.567 149.329   5.133
#> 7    labor   overall 76.883 74.150   5.972 579.024 154.000
#> 8    labor   between     NA 41.068  31.021 190.645  30.000
#> 9    labor    within     NA 61.202 -58.040 483.217   5.133

Factor variables analysis

The last group of functions is aimed at analyzing factor (categorical) variables, taking into account the nature of panel data.

decompose_factor() breaks down the overall frequency of each category into between‑entity (how many entities ever have that category) and within‑entity (average share of time an entity spends in that category) components.

decompose_factor(panel)
#> Analyzing all factor variables: industry, ownership, region
#>     variable   category count_overall share_overall count_between share_between
#> 1   industry Industry 1            63         0.401            13         0.433
#> 2   industry Industry 2            45         0.287            11         0.367
#> 3   industry Industry 3            49         0.312            10         0.333
#> 4  ownership    private            80         0.510            17         0.567
#> 5  ownership     public            36         0.229             9         0.300
#> 6  ownership      mixed            41         0.261            10         0.333
#> 7     region       west            38         0.242             7         0.233
#> 8     region       east            40         0.255             8         0.267
#> 9     region      north            36         0.229             7         0.233
#> 10    region      south            43         0.274             8         0.267
#>    share_within
#> 1         0.918
#> 2         0.809
#> 3         0.917
#> 4         0.894
#> 5         0.787
#> 6         0.772
#> 7         1.000
#> 8         1.000
#> 9         1.000
#> 10        1.000

summarize_transition() computes transition counts and shares between states of a factor variable over consecutive time periods. Here we analyze transitions in ownership.

summarize_transition(panel, select = "ownership")
#> 23 rows with NA values in 'ownership' removed.
#>   from_to private public mixed
#> 1 private   0.939  0.030  0.03
#> 2  public   0.016  0.984  0.00
#> 3   mixed   0.033  0.067  0.90