Glow is an open-source toolkit for working with genomic data at biobank-scale and beyond. The toolkit is natively built on Apache Spark, the leading unified engine for big data processing and machine learning, enabling the scale of the cloud for genomics workflows.
Glow makes genomic data work with Spark SQL, the leading engine for working with large structured datasets. It fits natively into the ecosystem of tools that have enabled thousands of organizations to scale their workflows to petabytes of data.
Glow works with datasets in common file formats like VCF or BGEN as well as common big data standards. You can write queries using the native Spark SQL APIs in Python, SQL, R, Java, and Scala. The same APIs allow you to bring your genomic data together with other datasets like electronic health records, real world evidence, and medical images. Glow makes it easy to parallelize existing tools and libraries implemented as command line tools or Pandas functions.
If you’ve used Spark before, you don’t need to learn any new APIs to get started with Glow. The toolkit includes the building blocks that you need to perform the most common analyses right away: