Big Data Analytics with Python at AIMS Ghana


Big Data Analytics with Python at AIMS Ghana

Course Information

Introduction

As part of the capacity development pillar of the Big Data for Development project, AIMS-NEI designed the Big Data for Development Short Course Program (BD4D-SCP) and training sessions are being delivered across the AIMS-NEI network, in Rwanda, Senegal, Cameroon and now in Ghana.

The course targets people (based in Accra, Ghana) with passion in Data Science in general and in particular in Big Data Analytics, having at least a 4 years’ undergraduate degree or a minimum 2 to 3 years of work experience as a professional in Statistics, Information and Technology or any other Data Science related discipline.

A number of other short courses are in the pipeline to achieve our BD4D objectives to increase the number of data scientists in Africa and provide a platform for all practitioners to interact.

Overview

As the world population and things become more and more connected, datasets are becoming increasingly large, such that traditional data processing software and techniques cannot deal with these large-scale datasets. Thus, you need specialized frameworks and tools such as Apache Spark to deal with large datasets. This course teaches the essential basics of processing large scale datasets using Python. In addition, the course also teaches you how to perform common data science tasks such as data wrangling and building machine learning models in Python.  This course takes a practical approach to equip participants with the most essential tools in the shortest possible time. The course emphasizes learning by doing, as such, they are a lot of exercises built into the course to give participants ample time to practice.

Summarized Objectives & Outcomes

  1. Understand intermediate to advanced concepts of the Python language: data structures, functions, classes and the python packages ecosystem
  2. Perform data science tasks using Python: data ingestion, processing, visualization, web scraping etc.
  3. Handle large scale dataset (20gb+) using Apache Spark: big data basics, Hadoop ecosystem, cloud computing platforms, big data processing with Apache Spark.
  4. Be familiar with essential machine learning (ML) theory: the learning problemtypes of learning, loss functions, linear models, deep learning and more.
  5. Be able to build and evaluate machine learning models: use scikit-learn and TensorFlow to build and evaluate models using Python.
  6. Appreciate real world ML and big data use cases: object detection in android devices, analyze large scale GPS data for human mobility use case.

Outline
Day 1:
 Advanced Concepts in Python: on this first day, the course will focus on Python language to build strong foundation for the rest of the course materials. Participants will be introduced to intermediate to advanced level practical techniques such as writing functions, classes, error handling, packaging python code and more.
Day 2: Python for Data Science: during the second day, the focus is on performing common data science tasks using Python. We will go through how to do data ingestion, processing, analysis, visualization, web scraping and more using Python and along the way introduce essential packages (e.g., pandas, geopandas, numpy, matplotlib etc.) for doing these tasks.
Day 3: Big Data Processing: on the third day, the course focuses on how to handle large data sets as using Python. The following topics will be covered: introduction to big data, multiprocessing in Python, Apache Spark, how to use common cloud platforms and more.
Day 4: Machine Learning (ML) in Python: on this day, the course will first provide an introductory lecture on machine learning. The rest of the day will focus on how to perform various ML tasks (e.g., data preparation, model building, evaluation and interpretation) using the scikit-learn package in Python.
Day 5: Putting it All Together: during the final day, we will focus on using the skills gained in this course to solve real life data science problems by looking at case studies. Potential case studies to be covered include: how to process nigh lights satellite images (geospatial), how to process massive call records from cellphones (mobile data) and how build ML models to impute missing sensor data (sensor data).

Pre-requisites
• Programming: ability to write a simple program in Python (basic Python level)
• Math and Statistics: a background in statistics, data science, or any quantitative sciences.

Hardware Requirements

For an optimal student experience, we recommend the following hardware configuration:
1. OS: Windows 7 SP1 64-bit, Windows 8.1 64-bit or Windows 10 64-bit, Ubuntu Linux, or the latest version of OS X
2. Processor: Intel Core i5 or equivalent
3. Memory: 8 GB RAM preferred
4. Storage: at least 100 GB available space
5. Computer should preferably have access to internet

Software Requirements

You’ll also need the following software installed in advance:
1. Browser: Google Chrome/Mozilla Firefox Latest Version
2. Text editors: Atom/Sublime Text as IDE (Optional, as you can practice everything using Jupyter notebook on your browser)
3. Anaconda: can be installed from here- with Python 3

Primary Keywords
Python; Big Data Analytics; Apache Spark; Machine Learning; Data Science.

Course Instructor & Interested Applicants

Practical Information

This training will take place from 14-18 April 2020 in Accra, Ghana.  Attendance to the course is limited to 50 participants and free of charge. The program will provide lunches and coffee breaks during the training. Selected candidates will be required to cover their transport costs to attend the training. Female applicants are strongly encouraged.

Instructor Profile

Dr. Dunstan Matekenya is a consummate Data Scientist with over 10 years’ experience in both traditional statistics and modern machine learning methods. Currently, he works as a Data Scientist at the World Bank Group Headquarters in Washington DC. Prior to joining the WBG, Dunstan completed his PhD at the University of Tokyo in 2016. His Ph.D. research focused on use of machine learning methods to explore insights from mobile phone data. Before re-orienting his career into Data Science, Dunstan earlier worked as a Statistician at the National Statistical Office in Malawi from 2007 up until 2017. In Malawi, he actively contributed to flagship projects such as the 2008 Malawi Population and Housing Census and led the GIS unit. His passion includes contributing to modernization of official statistics in developing countries with use of alternative data sources such as mobile phone data as well as improving capacity in Data Science.

Application Selection Process

All candidates interested in applying for the Big Data Analytics with Python short course must use the AIMS-NEI online application system  to complete and submit their application with all supporting documents by the deadline indicated below. We will notify shortlisted candidates to collect additional information to finalize their applications. Within a week of the deadline, we will inform successful applicants. Applicants who do not receive any feedback from AIMS a week after the deadline must consider their applications as unsuccessful.

Female applicants are strongly encouraged.

Deadline for applications: March 29th, 2020 – 11:59 PM (GMT). 

Any inquiries about this short course were expected to be sent to aii@nexteinstein.org.