Skip to contents

Combines a multi-gene expression matrix with sample metadata into a long-format, analysis-ready data frame. Each row represents one gene-sample observation, with human-readable gene names resolved from the annotation data frame.

Usage

build.analysis.df(expr.matrix, phenotype, genes, group.col = "disease.state")

Arguments

expr.matrix

Numeric matrix of gene expression values as returned by get.gene.expression(). Rows are probe IDs, columns are sample IDs.

phenotype

Data frame of sample metadata as returned by extract.expression()$phenotype. Row names must correspond to the column names of expr.matrix.

genes

Data frame of gene annotations as returned by extract.expression()$gene. Row indices must correspond to the probe IDs (row names) of expr.matrix. Used to resolve probe IDs to the human-readable gene names stored in the "Gene symbol" column.

group.col

Character. Name of the column in phenotype to use as the grouping variable. Default is "disease.state".

Value

A long-format data frame with one row per gene-sample pair and three columns:

gene

Character. Human-readable gene name resolved from the gene annotation data frame.

expression

Numeric. Expression value for that gene-sample pair.

group

Character or factor. Group label for each sample, renamed from the group.col column in phenotype for consistency with downstream functions.

Rows with NaN or NA expression values are removed.

Details

The function pivots expr.matrix from wide format (probes x samples) to long format, merges sample metadata by sample ID, resolves probe IDs to gene names using the annotation data frame, and selects the three columns needed for analysis. The output is the standard input for analyze.gene(), gene.analysis.plot(), and fit.lasso().

Requires tidyr for the pivot step. Ensure tidyr is listed under Imports in the package DESCRIPTION.

Examples

# \donttest{
geo <- extract.expression(load.geo.soft(accession = "GDS3268", log.transform = TRUE))
#> GDS3268 not found locally, downloading from NCBI GEO...
#> Using locally cached version of GDS3268 found here:
#> /tmp/RtmpxRZSjV/GDS3268.soft.gz 
#> Warning: NaNs produced
#> Using locally cached version of GPL1708 found here:
#> /tmp/RtmpxRZSjV/GPL1708.annot.gz 
probe <- find.probe.by.gene(geo$gene, c("MUC20", "ADH1A"))
expr <- get.gene.expression(geo$expression, probe)
df <- build.analysis.df(expr, geo$phenotype, geo$gene)
head(df)
#> [1] gene       expression group     
#> <0 rows> (or 0-length row.names)
# }