// Data Scientist · Analyst · Engineer

Fromrawsignals
torealinsights.

Data scientist who builds the infrastructure behind the analysis, from raw data to production-ready models and systems.

scroll
Alben Antappan

Data scientist with three years of industry experience and an MS in Data Science from Indiana University Bloomington. Currently doing research at the IU School of Medicine, building the pipelines that turn raw clinical data into structured, analysis-ready inputs for large-scale cohort modeling.

Before grad school, I spent three years at Capgemini as a Senior Analyst and Software Engineer building financial data systems for 7 business units across 25+ global regions - Power BI dashboards, ETL pipelines, Flask APIs, and a full migration to Microsoft Fabric. Owning the full stack from data to insight.

That's still true today. I don't just build models, I build what makes them possible.

0
years industry
0
projects
0
certifications
Indiana University
Indiana University
School of Medicine
Nov 2025 - Present
Bloomington, IN, USA
Graduate Research Assistant
  • Deployed a clinical NLP pipeline on H100 GPU partitions (HPC) using GPT and LLaMA transformer models to extract structured variables from unstructured clinical reports, evaluating output accuracy against expert-labeled ground truth.
  • Built Python signal denoising and feature extraction modules for multi-modal neuroimaging data (MRI, fMRI, DTI) across 300+ subjects, producing analysis-ready datasets for cohort-level statistical modeling.
  • Automated neuroimaging preprocessing workflows (fMRIPrep, QSIPrep/QSIRecon) on an HPC cluster, standardizing raw DICOM data to BIDS format to ensure consistent, high-quality inputs across the full subject cohort.
  • Mapped structural and functional brain connectivity networks across multiple neurological atlases, resolving data grouping anomalies and standardizing hierarchical relational structures to enable accurate downstream statistical analysis and cohort-level modeling.
PythonLLMNLPHPCSLURMApptainerNeuroimagingFSLFreeSurfer
Indiana University
Indiana University
Kelley School of Business
May - Aug 2025
Bloomington, IN, USA
Graduate Research Assistant
  • Engineered Python ETL pipelines to ingest, clean, and transform large-scale FCC broadband and U.S. Census data across 10,000+ school districts, resolving schema inconsistencies and building the district-level data model.
  • Built R-based data wrangling pipelines to harmonize 5 quarters of nurse engagement survey data across 70+ hospital units, producing statistical correlation matrix visualizations across 13 dimensions for a healthcare operations study.
PythonRETLData WranglingStatistical AnalysisData Visualization
Capgemini
Capgemini
Jul 2021 - Jul 2024
Mumbai, MH, India
Senior Analyst / Software Engineer
  • Engineered SQL and Python-based ETL pipelines to automate financial reporting across 7 business units and 25+ global regions, replacing error-prone Excel workflows and improving reporting accuracy by 15%, enabling leadership to make faster, data-driven decisions at scale.
  • Developed interactive Power BI dashboards tracking global P&L and financial KPIs for senior leadership, backed by Flask APIs and optimized SQL pipelines, cutting ad-hoc reporting requests and delivering real-time business intelligence.
  • Led end-to-end POC for migrating legacy ETL workflows and analytics dashboards to Microsoft Fabric, redesigning the data transformation layer and validating the full system architecture to reduce query latency and improve scalability.
  • Tuned ARIMA model parameters and engineered input features for time-series forecasting simulations, validating model performance against holdout data and optimizing execution pipeline to reduce runtime by 25%.
PythonSQLFlaskREST APIPower BIETLMicrosoft FabricARIMA
Indiana University Bloomington
Indiana University Bloomington
Bloomington, IN, USA
Master of Science - Data Science
Aug 2024 - May 2026
GPA 3.76 / 4.0
University of Mumbai
University of Mumbai
Mumbai, MH, India
Bachelor of Engineering - Information Technology
Aug 2017 - Jun 2021
GPA 8.54 / 10.0
01 /
Live DemoGitHub
FinSight AI: SEC Filing Analysis Platform
  • Architected a full-stack app with a React/Vercel frontend and containerized FastAPI backend on HuggingFace Spaces, with async background jobs, SSE token streaming, and a FAISS index persistence layer backed by HuggingFace Datasets.
  • Built a production RAG pipeline using FAISS vector search, transformer-based BGE embeddings, and MMR retrieval over SEC 10-K/10-Q filings, serving Llama 3.1 8B to generate structured financial summaries and stream multi-turn conversational Q&A responses.
  • Engineered live market data enrichment via yfinance injection into LLM context, implemented per-session conversation memory for stateful multi-turn interactions, and integrated LangSmith for end-to-end observability, with GitHub Actions CI/CD and automatic FAISS index invalidation on new SEC filings.
PythonFastAPILangChainRAGFAISSHuggingFaceLangSmithDockerReactGitHub Actions
02 /
GitHub
U.S. Traffic Safety Intelligence Pipeline
  • Architected an event-driven pipeline on AWS where CSV uploads to S3 automatically trigger a Lambda function that invokes EC2 via SSM, running distributed PySpark jobs to clean, validate, and feature-engineer 500,000+ US accident records.
  • Configured SageMaker Autopilot to train, deploy, and serve accident severity prediction models via a live inference endpoint, with SNS alerts on pipeline completion and Power BI connected for downstream business intelligence reporting, completing a serverless end-to-end workflow from raw upload to model output.
AWSS3LambdaEC2SSMSageMakerSNSPySparkETLPower BI
Public Transit On-Time Performance Dashboard
  • Processed and modeled 1.9M+ GPS, APC, and schedule adherence transit records entirely within Power BI using Power Query, building a star-schema data model with fact and dimension tables to support high-granularity on-time performance analysis.
  • Developed custom DAX measures to define OTP KPIs and built a multi-page interactive dashboard with ArcGIS geospatial mapping, peak vs. non-peak comparative analysis, and route-level breakdowns, refined through direct stakeholder feedback from the transit agency.
Power BIPower QueryDAXData ModelingArcGISData Visualization
04 /
Lie Detection Using Facial Analysis, Electrodermal Activity, Pulse and Temperature
  • Developed a real-time classification pipeline integrating ML models, facial recognition, and IoT biosensor data.
  • Integrated multi-modal biological signals and facial features into a unified predictive model.
PythonScikit-LearnMachine LearningOpenCVIoT
Core
PythonSQLRJavaC/C++Statistical ModelingMachine LearningPredictive ModelingFeature EngineeringHypothesis TestingModel DeploymentModel ValidationData WranglingAutomation
Machine Learning
PyTorchScikit-LearnXGBoostPandasNumPyDeep LearningComputer VisionOpenCVModel Evaluation
Generative AI & LLM
LLMNLPRAGTransformersEmbeddingsVector SearchLangChainFAISSHuggingFaceLangSmithArize
Data Engineering
ETLPySparkSparkKafkaAirflowdbtHadoopDistributed ProcessingRelational Data Modeling
Databases
MySQLPostgreSQLMSSQLOracle DBSQLiteNoSQLMongoDBNeo4J
Cloud & Platforms
AWSS3EC2LambdaSageMakerSNSSSMGCPBigQueryMicrosoft AzureMicrosoft FabricSnowflake
Visualization & BI
Power BIArcGISTableauLookerGrafanaDAXPower QueryPlotlyMatplotlibSeaborn
Backend & APIs
FastAPIFlaskRESTful APIStreamlit
DevOps & Tools
GitGitHub ActionsCI/CDDockerSLURMApptainerBash Scripting
Domain Knowledge
Clinical NLPOMOP Common Data ModelSNOMEDICD
Certifications
01 /JETIR
Lie Detection Using Facial Analysis, Electrodermal Activity, Pulse and Temperature
Journal of Emerging Technologies and Innovative Research
02 /IJARESM
Research on Lie Detector Using Facial Analysis and Heartbeat Sensing
International Journal of All Research Education and Scientific Methods
03 /IEJRD
Alexa Based Smart Home Monitoring System
International Engineering Journal for Research and Development

I worked with Alben as a research assistant in the Faculty Assistance with Data Science program, where he focused on data analysis - both in Python and R. Alben handled cleaning and merging complex datasets, writing clear and reproducible code, and producing useful summary tables and figures. He communicated well, met deadlines reliably, and asked smart questions that moved the project forward. I would happily work with him again and recommend him for any role involving applied data analysis!

RJ
RJ Niewoehner
Assistant Professor of Operations
Kelley School of Business, Indiana University

I had the pleasure of managing Alben for over three years, and he stands out as one of the most capable and reliable professionals I've worked with. He brings a strong combination of finance operations expertise and technical skills, including Python, web application development, SQL, Rest APIs, ML and Power BI. He has a natural ability to quickly grasp complex problems and translate them into effective, scalable solutions. Many times, the quality and impact of his work reflected a level of maturity well beyond his experience. Alben takes full ownership of his work and consistently delivers with minimal to no supervision. Once aligned on the objective, he executes with precision, attention to detail, and a strong sense of accountability. His work is thorough, reliable, and rarely requires rework. Beyond his technical strengths, he is highly dependable, committed, and someone you can trust with critical responsibilities. He would be a valuable asset to any organization, and I would gladly work with him again.

AR
Akhil Raghupatruni
Senior Consultant
Capgemini
The signal is clear.
Let's build something real.
+1 (812) 778-6696alben.antappan@gmail.comlinkedin.com/in/alben-antappangithub.com/AlbenZap
Bloomington, IN, USA
© 2026 Alben Antappan - All Rights Reserved