An open-source toolkit for large-scale genomic analysis




About Glow


Glow is an open-source toolkit for working with genomic data at biobank-scale and beyond. The toolkit is natively built on Apache Spark, the leading unified engine for big data processing and machine learning, enabling the scale of the cloud for genomics workflows.


Built to scale

Glow makes genomic data work with Spark SQL, the leading engine for working with large structured datasets. It fits natively into the ecosystem of tools that have enabled thousands of organizations to scale their workflows to petabytes of data.

Flexible

Glow works with datasets in common file formats like VCF or BGEN as well as common big data standards. You can write queries using the native Spark SQL APIs in Python, SQL, R, Java, and Scala. The same APIs allow you to bring your genomic data together with other datasets like electronic health records, real world evidence, and medical images. Glow makes it easy to parallelize existing tools and libraries implemented as command line tools or Pandas functions.

Easy to get started

If you’ve used Spark before, you don’t need to learn any new APIs to get started with Glow. The toolkit includes the building blocks that you need to perform the most common analyses right away:

  • Datasources for loading VCF and BGEN files into Spark DataFrames
  • Functions for performing quality control and data manipulation
  • Variant normalization and lift over
  • Regression functions
  • Integration with Spark ML libraries for population stratification
  • Utilities for piping DataFrames through command line tools

Contributors