Data Science & Visualization
Statistics and storytelling. Distributions, dashboards, charts that communicate, and the analysis discipline behind defensible product decisions.
Mastering Probability Distributions for Machine Learning
Probability distributions are maps of data behavior. Understanding them allows you to select better models, engineer features effectively, and quantify uncertainty in production pipelines.
Why R-Squared Misleads and How to Properly Evaluate Regression
R-squared measures explained variance but ignores model complexity and outliers. To truly understand model performance, you must use a suite of metrics—MAE, MSE, RMSE, and Adjusted R-squared—to identify where your model fails and why.
Improving Uncertainty Estimation for Classifier Performance
Standard confidence interval methods often fail for small datasets or high-performance models; using Agresti-Coull, Wilson, or regularized bootstrap methods significantly improves accuracy.
Mapping Data Science: A Periodic Table Approach
Data science can be decoded by organizing its concepts into a periodic table where rows represent data maturity (from raw to insights) and columns represent analytical activities (from acquisition to evaluation).
IBM Technology6 Habits That Elevate Data Science Projects Beyond Model Selection
Exceptional data science outcomes depend less on complex algorithms and more on disciplined fundamentals like data auditing, version control, and rigorous documentation.
Why Accuracy Metrics Hide ML Model Failures
High accuracy scores in automated systems like résumé classifiers often mask systemic biases and data quality issues that lead to unfair rejection patterns.
Spatial Graph Neural Networks for Urban Function Inference
A practical pipeline for urban function inference using city2graph, OSMnx, and PyTorch Geometric to classify POIs based on spatial relationships and graph topology.
Building 3D Medical Segmentation Pipelines with MONAI
This tutorial demonstrates an end-to-end 3D spleen segmentation pipeline using MONAI and a 3D UNet, covering data preprocessing, patch-based training, and sliding-window inference.
CrowdMath: A New Dataset for Mathematical Research Reasoning
CrowdMath is a new dataset derived from crowdsourced mathematical research discussions, designed to improve AI reasoning capabilities in complex, multi-step mathematical domains.
Building a Semantic Search and Classifier for ResearchMath-14k
This tutorial demonstrates how to build a semantic search engine and status classifier for the ResearchMath-14k dataset using sentence embeddings, TF-IDF, and logistic regression.
Why Singular Value Decomposition Outperforms Eigen Decomposition
While eigenvectors identify stable directions in square matrices, Singular Value Decomposition (SVD) provides a more robust, universal framework for analyzing the rectangular matrices found in modern neural networks.
Essential NumPy Concepts for Practical Data Science
Mastering eight core NumPy concepts—from vectorization to broadcasting—provides the foundation for 80% of daily data science tasks in Python.
Predicting US Recessions with DTW and Boosted Trees
A framework for predicting economic cycles by using Dynamic Time Warping to align yield curve data, followed by boosted tree modeling and AWS containerized deployment.
Demystifying ML Math: From Vectors to Eigenvalues
Machine learning math is often obscured by intimidating terminology. Practitioners view these concepts as tools for structuring data, measuring change, and quantifying uncertainty in decision-making.
Mastering Step Plots in Matplotlib
Step plots are superior to standard line plots for visualizing incremental state changes, such as inventory levels, interest rates, or discrete signals, where transitions are abrupt rather than gradual.
Practical Advanced Feature Engineering for Machine Learning
Feature engineering is the primary driver of model performance. By systematically handling missing data, outliers, skewed distributions, and categorical encoding, you can transform raw data into high-signal features that models can actually learn from.
skfolio: Build & Tune Portfolio Optimizers in Python
skfolio's scikit-learn API lets you construct, validate, and compare 18+ portfolio strategies—from baselines to HRP, Black-Litterman, factors, and tuned models—on S&P 500 returns with walk-forward CV and GridSearchCV.
Reproduce 2011 Sentiment Word Vectors in Python
Build sentiment-aware word embeddings from IMDb reviews via semantic learning with star ratings and linear SVM classification, reproducing Maas et al. (2011) – simple method rivals modern LLMs.
Scanpy Pipeline for PBMC scRNA-seq Clustering & Trajectories
Process PBMC-3k data with Scanpy: filter cells (min 200 genes, <2500 genes, <5% mt), remove Scrublet doublets, select HVGs (min_mean=0.0125, max_mean=3, min_disp=0.5), Leiden cluster at res=0.5, annotate via markers, infer PAGA/DPT trajectories, score IFN response.
NMI Bias Favors Complex Clusters Over Insight
Normalized Mutual Information (NMI) rewards over-segmentation and complexity in clustering, inflating scores for intuitively poor algorithms and distorting AI evaluations.
Balance Linear Simplicity and Nonlinear Flexibility to Avoid Fit Failures
Linear models underfit nonlinear data with rigid straight boundaries; nonlinear models overfit by memorizing noise with wiggly curves. Fix via bias-variance tradeoff for optimal generalization.
Time Series Fundamentals Before Modeling
Time series data depends on order—avoid shuffling or random splits. Decompose into trend, seasonality, cycles, noise; ensure stationarity (constant mean/variance/autocovariance) via differencing, logs, detrending; diagnose with ACF/PACF for AR/MA patterns.
Triple YOLO Recall with Adaptive Post-Processing
In crowded scenes, set YOLO confidence to 0.05, then filter dynamically by frame score distribution, box size (lower threshold for <5% height boxes), and pose keypoints (nose + shoulders) to detect 3x more people without retraining.
Synthetic Data Exposes Hidden ML Bias Before Production
Real training data hides bias via underrepresentation (e.g., rural at 9%), proxies, and skewed labels; generate synthetic data with controlled segments (e.g., rural at 25%) to reveal it through disaggregated AUC drops (0.791 to 0.768) and disparate impact <0.8, then retrain on mixed data to fix.
Momentum Dampens GD Zigzags via Gradient Averaging
On anisotropic loss surfaces (condition number 100), vanilla GD zigzags and takes 185 steps to converge (loss <0.001); momentum with β=0.9 converges in 159 steps by canceling steep-direction oscillations while accelerating flat directions—but β=0.99 diverges.
Track One User-Feature Pair to Catch ML Pipeline Bugs
A rec model's 0.91 AUC failed in prod after 4 days due to 21-hour stale user_30d_purchases features. Track user U-9842 and this feature through every pipeline layer to expose and prevent such mismatches.
Production ML Pipelines with ZenML: Custom Materializers & HPO
ZenML enables end-to-end ML pipelines with custom DatasetBundle materializers for metadata-rich serialization, fan-out over 4 hyperparameter configs for RandomForest/GradientBoosting/LogisticRegression, fan-in best-model selection by ROC AUC, full artifact tracking, and cache-driven reproducibility on breast cancer dataset.
Stream Parse TaskTrove Dataset for AI Task Insights
Stream multi-GB TaskTrove dataset without full download; parse gzip-compressed tar/zip/JSON binaries to analyze sources, sizes (median p50 KB compressed), filenames, and detect verifiers for RL-ready tasks via multi-signal heuristics.
Build Queryable Options IV DB from Live API Polls
Capture SpiderRock LiveImpliedQuote snapshots for TSLA every 10s into SQLite: append full history for audits (12k+ rows in 2min), upsert latest view per option_key. Query to reconstruct vol smiles and track ATM IV/skew changes over time.
Data Science Splits: Engineer Pipelines or Lead Decisions
Data scientist roles are dividing into technical data engineering (SQL up 18%, ETL up 18%) and strategic decision-making; AI automates mid-level generalist tasks, squeezing the middle—specialize in one side now.
Showing 30 of 72