Embedding UMAP

A UMAP of model embeddings over 111,329 labeled 100 bp genomic windows — coding (CDS), 5′/3′ UTRs, lncRNA, promoters and enhancers (ENCODE cCREs), and background — asking whether a model's representations segregate functional elements without supervision. Ported from GPN-Star (Ye, Benegas et al., bioRxiv 2025, Fig 4A/4B); the window set is their published songlab/gpn-star-umap-regions, so these plots are directly comparable to the paper. See #246 for the method.

Each window's embedding is the model's last-layer hidden state, mean-pooled over the central 100 bp of a context-sized window and averaged across the forward and reverse-complement strands (for our causal models the strand average also corrects the left-context bias). Every window is embedded — none dropped — so the point set is identical across models of different context sizes. Left: colored by annotated region. Right: by conservation (75th-percentile phastCons). Points are rasterized; the legends carry the color key.