Apache Spark / Hadoop

Distributed big data processing for genomics and large-scale analytics

tag3.4+ / 3.3+ gavelApache 2.0 codePython, Scala, Java, R devicesKENET HPC Cluster / Cloud

open_in_new Visit Software menu_book Documentation download Download code_blocks Source

info Overview

Apache Spark and Hadoop are deployed on the KENET HPC cluster at the University of Nairobi for large-scale data processing tasks. Spark is used for genomics variant calling pipelines, large NLP corpus processing, and satellite imagery batch analysis. PySpark provides the primary Python interface for DASCLAB researchers.

checklist Key Features

PySpark: Python API for distributed data processing
Spark SQL: structured data queries at scale
MLlib: distributed machine learning algorithms
Spark Streaming: real-time data processing
Hadoop HDFS: distributed file system for large datasets
KENET HPC integration: 128-core cluster access for DASCLAB researchers

Use Cases at DASCLAB

Genomics variant calling and population genetics pipelines
Large-scale NLP corpus preprocessing (50M+ token datasets)
Sentinel-2 satellite image batch processing for AgroML
Distributed cross-validation for large ML models
Student big data coursework on KENET cluster

Quick Links

language Official Website menu_book Documentation download Download / Manual code_blocks Source Code

contact_support

Questions about this tool?

Contact the DASCLAB team for support, training, or collaboration enquiries.

Get in Touch