// Data Scientist · Analyst · Engineer

Fromrawsignals
torealinsights.

Data scientist who builds the infrastructure behind the analysis, from raw data to production-ready models and systems.

scroll
Alben Antappan

I'm a data scientist pursuing my MS in Data Science at Indiana University Bloomington, currently doing research at the IU School of Medicine, building the pipelines that turn raw neuroimaging data into structured, analysis-ready inputs for large-scale cohort modeling.

Before grad school, I spent three years at Capgemini as a Senior Analyst and Software Engineer building financial data systems across 7 business units and 25+ global regions. Predictive forecasting models, Power BI dashboards tracking global P&L, Flask APIs, optimized SQL pipelines, and an ETL migration to Microsoft Fabric, owning the full stack from data to insight.

That's still true today. I don't just build models, I build what makes them possible.

0
years industry
0
projects
0
certifications
Indiana University
Indiana University
School of Medicine
Nov 2025 - Present
Bloomington, IN
Graduate Research Assistant
  • Deployed an LLM inference pipeline on H100 GPU partitions using GPT and LLaMA models to extract 199 NACC-defined variables from neuropathology reports, implementing ground truth comparison for output validation across multiple runs.
  • Architected a fault-tolerant neuroimaging data ingestion pipeline on an HPC cluster using Slurm array jobs and Apptainer containers, processing MRI/fMRI/DTI datasets for 300+ subjects with full subject isolation.
  • Automated end-to-end neuroimaging preprocessing (DICOM - BIDS - fMRIPrep/QSIPrep/QSIRecon) via Bash scripting with robust error handling and logging, guaranteeing reliable parallel execution across distributed compute nodes.
  • Mapped structural and functional brain connectivity networks across multiple atlases, resolving data grouping anomalies and standardizing hierarchical structures for accurate regional statistical analysis.
PythonLLMHPCSLURMApptainerNeuroimagingFSLFreeSurfer
Indiana University
Indiana University
Kelley School of Business
May - Aug 2025
Bloomington, IN
Graduate Research Assistant
  • Built Python pipelines to scrape, clean, and transform large-scale FCC broadband and U.S. Census data across 10,000+ school districts, resolving schema inconsistencies and enabling the district-level joins that formed the backbone of a published education equity study.
  • Built reproducible R Markdown ETL workflows to standardize and harmonize 5 quarters of nurse engagement survey data across 70+ hospital units, producing correlation matrix visualizations across 13 survey dimensions for a healthcare operations study.
PythonRData WranglingStatistical AnalysisData Visualization
Capgemini
Capgemini
Jul 2021 - Jul 2024
Mumbai, India
Senior Analyst / Software Engineer
  • Engineered scalable Python and SQL-based ETL pipelines for a global financial platform across 7 business units and 25+ regions, encoding complex business logic to replace manual Excel workflows and improve reporting accuracy by 15%.
  • Architected a self-service web portal backend using Flask and RESTful APIs, enabling automated real-time report generation and powering interactive Power BI dashboards tracking global P&L and financial KPIs.
  • Led end-to-end POC for enterprise migration of legacy ETL workflows to Microsoft Fabric, rebuilding the full system architecture and data transformation layer to reduce querying latency and improve scalability.
  • Optimized time-series forecasting simulations incorporating ARIMA models, achieving 25% faster execution through Python refactoring and MySQL query tuning while maintaining statistical validity.
PythonSQLFlaskREST APIPower BIETLMicrosoft FabricMySQLARIMA
Indiana University Bloomington
Indiana University Bloomington
Bloomington, IN, USA
Master of Science - Data Science
Aug 2024 - May 2026
GPA 3.73 / 4.0
University of Mumbai
University of Mumbai
Mumbai, India
Bachelor of Engineering - Information Technology
Aug 2017 - Jun 2021
GPA 8.54 / 10.0
01 /
Live DemoGitHub
FinSight AI: SEC Filing Analysis Platform
  • Architected a full-stack app with a React/Vercel frontend and containerized FastAPI backend on HuggingFace Spaces, with async background jobs, SSE token streaming, and a FAISS index persistence layer backed by HuggingFace Datasets.
  • Built a production RAG pipeline using FAISS vector search, BGE embeddings, and MMR retrieval over SEC 10-K/10-Q filings with Llama 3.1 8B, generating structured financial summaries and streaming multi-turn Q&A responses.
  • Engineered live yfinance market data injection into LLM context, per-session conversation memory, GitHub Actions CI/CD, LangSmith observability, and automatic FAISS index invalidation on new SEC filings.
PythonFastAPIDockerFAISSLangChainRAGLangSmithReactGitHub Actions
02 /
GitHub
U.S. Traffic Safety Intelligence Pipeline
  • Engineered a fully automated event-driven pipeline using S3 triggers, Lambda, and AWS Systems Manager to orchestrate distributed PySpark processing on EC2, handling ingestion, validation, and feature engineering across 500,000+ accident records.
  • Integrated SageMaker Autopilot for automated model training and endpoint deployment, with SNS-based status alerting and Power BI connected for downstream visual analytics - completing a serverless end-to-end workflow from raw upload to model output.
AWSPySparkLambdaEC2SageMakerSNSETLPower BI
Public Transit On-Time Performance Dashboard
  • Designed a star-schema data model within Power BI integrating 1.9M+ GPS, APC, and schedule adherence records, handling all transformation and standardization through Power Query.
  • Built custom DAX measures for OTP classification and KPI tracking, delivering a multi-page public-facing dashboard with ArcGIS stop-level mapping and iterative redesigns based on direct stakeholder feedback from the transit agency.
Power BIPower QueryDAXData ModelingArcGISData Visualization
04 /
Lie Detection Using Facial Analysis, Electrodermal Activity, Pulse and Temperature
  • Developed a real-time classification pipeline integrating ML models, facial recognition, and IoT biosensor data.
  • Integrated multi-modal biological signals and facial features into a unified predictive model.
PythonScikit-LearnMachine LearningOpenCVIoT
Core
PythonSQLRJavaC/C++Statistical ModelingFeature EngineeringData WranglingAutomation
Machine Learning
Scikit-LearnPyTorchPandasNumPyAWS SageMaker
AI
LLMRAGLangChainLangSmithFAISSHuggingFace
Data Engineering
PySparkSparkKafkaAirflowdbtETLDistributed ProcessingRelational Data Modeling
Cloud & Platforms
AWSS3EC2LambdaSageMakerGCPBigQueryMicrosoft AzureMicrosoft FabricSnowflake
Visualization & BI
Power BITableauLookerGrafanaPlotlyMatplotlibSeabornDAXPower Query
Backend & APIs
FlaskFastAPIRESTful APIStreamlit
DevOps & Tools
GitGitHub ActionsCI/CDDockerSLURMApptainerBashShell
Certifications
01 /JETIR
Lie Detection Using Facial Analysis, Electrodermal Activity, Pulse and Temperature
Journal of Emerging Technologies and Innovative Research
02 /IJARESM
Research on Lie Detector Using Facial Analysis and Heartbeat Sensing
International Journal of All Research Education and Scientific Methods
03 /IEJRD
Alexa Based Smart Home Monitoring System
International Engineering Journal for Research and Development

I worked with Alben as a research assistant in the Faculty Assistance with Data Science program, where he focused on data analysis - both in Python and R. Alben handled cleaning and merging complex datasets, writing clear and reproducible code, and producing useful summary tables and figures. He communicated well, met deadlines reliably, and asked smart questions that moved the project forward. I would happily work with him again and recommend him for any role involving applied data analysis!

RJ
RJ Niewoehner
Assistant Professor of Operations
Kelley School of Business, Indiana University

I had the pleasure of managing Alben for over three years, and he stands out as one of the most capable and reliable professionals I've worked with. He brings a strong combination of finance operations expertise and technical skills, including Python, web application development, SQL, Rest APIs, ML and Power BI. He has a natural ability to quickly grasp complex problems and translate them into effective, scalable solutions. Many times, the quality and impact of his work reflected a level of maturity well beyond his experience. Alben takes full ownership of his work and consistently delivers with minimal to no supervision. Once aligned on the objective, he executes with precision, attention to detail, and a strong sense of accountability. His work is thorough, reliable, and rarely requires rework. Beyond his technical strengths, he is highly dependable, committed, and someone you can trust with critical responsibilities. He would be a valuable asset to any organization, and I would gladly work with him again.

AR
Akhil Raghupatruni
Senior Consultant
Capgemini
The signal is clear.
Let's build something real.
+1 (812) 778-6696alben.antappan@gmail.comlinkedin.com/in/alben-antappangithub.com/AlbenZap
Bloomington, IN
© 2026 Alben Antappan - All Rights Reserved