arrow_back All Software
electric_bolt
Big Data Infrastructure

Apache Spark / Hadoop

Distributed big data processing for genomics and large-scale analytics

tag3.4+ / 3.3+ gavelApache 2.0 codePython, Scala, Java, R devicesKENET HPC Cluster / Cloud

info Overview

Apache Spark and Hadoop are deployed on the KENET HPC cluster at the University of Nairobi for large-scale data processing tasks. Spark is used for genomics variant calling pipelines, large NLP corpus processing, and satellite imagery batch analysis. PySpark provides the primary Python interface for DASCLAB researchers.

checklist Key Features

  • PySpark: Python API for distributed data processing
  • Spark SQL: structured data queries at scale
  • MLlib: distributed machine learning algorithms
  • Spark Streaming: real-time data processing
  • Hadoop HDFS: distributed file system for large datasets
  • KENET HPC integration: 128-core cluster access for DASCLAB researchers