#data-science
Every summary, chronological. Filter by category, tag, or source from the rail.
skfolio: Build & Tune Portfolio Optimizers in Python
skfolio's scikit-learn API lets you construct, validate, and compare 18+ portfolio strategies—from baselines to HRP, Black-Litterman, factors, and tuned models—on S&P 500 returns with walk-forward CV and GridSearchCV.
2026 Vector DBs: Match Scale, Cost, Stack for RAG Success
Leverage existing Postgres/Mongo with pgvector (millions vectors, free) or Atlas ($30/mo max Flex) to avoid sprawl; self-host Qdrant ($30-50/mo for 50M vectors) for perf; Pinecone ($20/mo) or Milvus (100B+) for managed scale.
Scanpy Pipeline for PBMC scRNA-seq Clustering & Trajectories
Process PBMC-3k data with Scanpy: filter cells (min 200 genes, <2500 genes, <5% mt), remove Scrublet doublets, select HVGs (min_mean=0.0125, max_mean=3, min_disp=0.5), Leiden cluster at res=0.5, annotate via markers, infer PAGA/DPT trajectories, score IFN response.
NMI Bias Favors Complex Clusters Over Insight
Normalized Mutual Information (NMI) rewards over-segmentation and complexity in clustering, inflating scores for intuitively poor algorithms and distorting AI evaluations.
Balance Linear Simplicity and Nonlinear Flexibility to Avoid Fit Failures
Linear models underfit nonlinear data with rigid straight boundaries; nonlinear models overfit by memorizing noise with wiggly curves. Fix via bias-variance tradeoff for optimal generalization.
Time Series Fundamentals Before Modeling
Time series data depends on order—avoid shuffling or random splits. Decompose into trend, seasonality, cycles, noise; ensure stationarity (constant mean/variance/autocovariance) via differencing, logs, detrending; diagnose with ACF/PACF for AR/MA patterns.
Test Campaign Boosts Profit but Needs Funnel Fixes
Test campaign delivers higher revenue ($781,850 vs $758,050) and profit ($704,958 vs $691,232) with stat sig (p~0), higher CTR (10.2% vs 5.1%), but lower ROI (9.3 vs 10.6) and CAC ($4.92 vs $4.41). Scale it while targeting mid-funnel drop-offs.
Synthetic Data Exposes Hidden ML Bias Before Production
Real training data hides bias via underrepresentation (e.g., rural at 9%), proxies, and skewed labels; generate synthetic data with controlled segments (e.g., rural at 25%) to reveal it through disaggregated AUC drops (0.791 to 0.768) and disparate impact <0.8, then retrain on mixed data to fix.
Track One User-Feature Pair to Catch ML Pipeline Bugs
A rec model's 0.91 AUC failed in prod after 4 days due to 21-hour stale user_30d_purchases features. Track user U-9842 and this feature through every pipeline layer to expose and prevent such mismatches.
Production ML Pipelines with ZenML: Custom Materializers & HPO
ZenML enables end-to-end ML pipelines with custom DatasetBundle materializers for metadata-rich serialization, fan-out over 4 hyperparameter configs for RandomForest/GradientBoosting/LogisticRegression, fan-in best-model selection by ROC AUC, full artifact tracking, and cache-driven reproducibility on breast cancer dataset.
Scale GenAI to Billions of Rows in BigQuery at 94% Less Cost
BigQuery's optimized mode distills LLMs into lightweight models using embeddings, slashing token use by 94% (55M to 3M) and query time from 16min to 2min on 34k images or 50k voice commands, scaling to billions of rows.
Stream Parse TaskTrove Dataset for AI Task Insights
Stream multi-GB TaskTrove dataset without full download; parse gzip-compressed tar/zip/JSON binaries to analyze sources, sizes (median p50 KB compressed), filenames, and detect verifiers for RL-ready tasks via multi-signal heuristics.
Build Queryable Options IV DB from Live API Polls
Capture SpiderRock LiveImpliedQuote snapshots for TSLA every 10s into SQLite: append full history for audits (12k+ rows in 2min), upsert latest view per option_key. Query to reconstruct vol smiles and track ATM IV/skew changes over time.
Parse, Analyze, Visualize Hermes Agent Traces for Fine-Tuning
Extract thoughts/tool calls from Hermes agent dataset with regex parsers; compute stats like avg turns per trajectory, tool frequencies, error rates; visualize patterns; tokenize with assistant-only labels for SFT on Qwen models.
Data Science Splits: Engineer Pipelines or Lead Decisions
Data scientist roles are dividing into technical data engineering (SQL up 18%, ETL up 18%) and strategic decision-making; AI automates mid-level generalist tasks, squeezing the middle—specialize in one side now.
Autodata: Agents Create Superior Synthetic Training Data
Meta's Autodata deploys AI agents as data scientists to iteratively generate high-quality QA pairs from CS papers, outperforming CoT Self-Instruct by expanding weak-strong solver gaps from 1.9 to 34 points and boosting downstream model training.
Flink Treats Batch as Streaming for Unified Low-Latency Processing
Apache Flink processes unbounded streams and bounded batches with one engine using operators, state, windows, and exactly-once guarantees, eliminating dual codebases for real-time apps like recommendation engines handling millions of events.
Data And Beyond Grows to 49K Views, AI Topics Dominate
April 2026 stats: 49K views, 14.8K reads, +90 followers to 2K. Top stories cover Spark optimization, Claude AI leaks, clustering pitfalls, and RAG vs MCP.
Decompose Signals into Frequencies for Easier Analysis
Fourier transform breaks time-domain signals into frequency components, exposing periodic patterns buried in noise for filtering, compression, and fault detection—reversible and efficient via FFT.
Bigtable Scales Petabytes for Real-Time NoSQL Workloads
Bigtable auto-scales to hundreds of petabytes and millions of ops/sec with low latency, powering Google Search/YouTube/Maps; ideal for time series, ML features, and streaming via Flink/Kafka integrations.
Google Cloud TechETL Pipeline Turns Messy HR Data into Star Schema Insights
Build a scalable ETL pipeline to restructure flat HR data into a star schema fact/dimension tables, enabling analysis of manager performance, diversity (60% White, 56.6% female), recruitment channels, and 71% accurate attrition prediction where tenure drives 47% of decisions.
Automate Weekly PDF Reports with Python ETL Pipeline
Load/merge e-commerce datasets, compute revenue/profit/AOV/growth metrics, generate PDF with matplotlib/ReportLab charts and rule-based insights, email via smtplib, schedule weekly via GitHub Actions cron.
AI Amplifies Bad Data—Fix It First
AI doesn't fix poor data quality; it scales the errors, leading to wrong decisions like approving bad loans or prioritizing wrong customers. 85% of AI failures stem from bad data, so clean data before adopting AI.
Preprocessing Swings CNN Accuracy from 65% to 87% on CIFAR-10
Raw CIFAR-10 pixels yield 65% test accuracy; normalization/standardization lift to 69%; geometric augmentation maintains ~67%; photometric brightness/contrast crashes to 20%; combined pipeline with deeper CNN hits 87%.
Launch Data Governance via Pilot Projects, Not Big Plans
Start data governance with a narrow pilot project as a starting line to prove value quickly, then scale incrementally while building self-sustaining mechanisms like legislation, judiciary, and enforcement.
TabPFN Beats Tree Models on Tabular Accuracy with Zero Training
On a 5k-sample tabular dataset, TabPFN hits 98.8% accuracy vs CatBoost's 96.7% and Random Forest's 95.5%, with 0.47s setup but 2.21s inference due to in-context learning at predict time.
Data And Beyond Doubles Followers to 2K in 10 Months
Medium data/AI publication grew from 1,000 to 2,000 followers in ~10 months, fueled by practical guides on AI agents, ML models, data tools, and analysis techniques—top post on vector databases.
Cohort Analysis Exposes Donor Retention Risks
Rising aggregate retention (27% to 42%) hides leaky bathtub: 75% of 2025 revenue from 2024-2025 cohorts, with older cohorts contributing <2% each, risking collapse without long-term base.
Cleveland's Enduring Impact on Data Viz and Science
William Cleveland pioneered data visualization as a rigorous discipline via graphical perception studies and books like The Elements of Graphing Data, while outlining data science's foundations in 2001, shaping tools data workers use today.
AI SQL: Strengths, 4 Pitfalls, and Fix Checklist
AI reliably generates simple aggregations and boilerplate SQL but fails on fanout joins, wrong window frames, NULL mishandling, and dialect mismatches. Use a detailed prompt template and 6-point review checklist to catch errors fast.
Showing 30 of 60