info Overview
Apache Spark and Hadoop are deployed on the KENET HPC cluster at the University of Nairobi for large-scale data processing tasks. Spark is used for genomics variant calling pipelines, large NLP corpus processing, and satellite imagery batch analysis. PySpark provides the primary Python interface for DASCLAB researchers.
checklist Key Features
- PySpark: Python API for distributed data processing
- Spark SQL: structured data queries at scale
- MLlib: distributed machine learning algorithms
- Spark Streaming: real-time data processing
- Hadoop HDFS: distributed file system for large datasets
- KENET HPC integration: 128-core cluster access for DASCLAB researchers