№ 02 / SUMMARIES

#data-science

Every summary, chronological. Filter by category, tag, or source from the rail.

Tag · #data-science
DAY 01Yesterday MAY 12 · 20261 SUMMARIES
MarkTechPostData Science & Visualization

skfolio: Build & Tune Portfolio Optimizers in Python

skfolio's scikit-learn API lets you construct, validate, and compare 18+ portfolio strategies—from baselines to HRP, Black-Litterman, factors, and tuned models—on S&P 500 returns with walk-forward CV and GridSearchCV.

MarkTechPost
DAY 02Sunday MAY 10 · 20261 SUMMARIES
MarkTechPost

2026 Vector DBs: Match Scale, Cost, Stack for RAG Success

Leverage existing Postgres/Mongo with pgvector (millions vectors, free) or Atlas ($30/mo max Flex) to avoid sprawl; self-host Qdrant ($30-50/mo for 50M vectors) for perf; Pinecone ($20/mo) or Milvus (100B+) for managed scale.

MarkTechPost
DAY 03Friday MAY 8 · 20262 SUMMARIES
MarkTechPostData Science & Visualization

Scanpy Pipeline for PBMC scRNA-seq Clustering & Trajectories

Process PBMC-3k data with Scanpy: filter cells (min 200 genes, <2500 genes, <5% mt), remove Scrublet doublets, select HVGs (min_mean=0.0125, max_mean=3, min_disp=0.5), Leiden cluster at res=0.5, annotate via markers, infer PAGA/DPT trajectories, score IFN response.

MarkTechPost
AI Simplified in Plain EnglishData Science & Visualization

NMI Bias Favors Complex Clusters Over Insight

Normalized Mutual Information (NMI) rewards over-segmentation and complexity in clustering, inflating scores for intuitively poor algorithms and distorting AI evaluations.

DAY 04Thursday MAY 7 · 20262 SUMMARIES
Data and BeyondData Science & Visualization

Balance Linear Simplicity and Nonlinear Flexibility to Avoid Fit Failures

Linear models underfit nonlinear data with rigid straight boundaries; nonlinear models overfit by memorizing noise with wiggly curves. Fix via bias-variance tradeoff for optimal generalization.

Data and Beyond
Towards AIData Science & Visualization

Time Series Fundamentals Before Modeling

Time series data depends on order—avoid shuffling or random splits. Decompose into trend, seasonality, cycles, noise; ensure stationarity (constant mean/variance/autocovariance) via differencing, logs, detrending; diagnose with ACF/PACF for AR/MA patterns.

DAY 05May 6, 2026 MAY 6 · 20262 SUMMARIES
Learning DataMarketing & Growth

Test Campaign Boosts Profit but Needs Funnel Fixes

Test campaign delivers higher revenue ($781,850 vs $758,050) and profit ($704,958 vs $691,232) with stat sig (p~0), higher CTR (10.2% vs 5.1%), but lower ROI (9.3 vs 10.6) and CAC ($4.92 vs $4.41). Scale it while targeting mid-funnel drop-offs.

Learning Data
Towards AIData Science & Visualization

Synthetic Data Exposes Hidden ML Bias Before Production

Real training data hides bias via underrepresentation (e.g., rural at 9%), proxies, and skewed labels; generate synthetic data with controlled segments (e.g., rural at 25%) to reveal it through disaggregated AUC drops (0.791 to 0.768) and disparate impact <0.8, then retrain on mixed data to fix.

DAY 06May 5, 2026 MAY 5 · 20261 SUMMARIES
Towards AIData Science & Visualization

Track One User-Feature Pair to Catch ML Pipeline Bugs

A rec model's 0.91 AUC failed in prod after 4 days due to 21-hour stale user_30d_purchases features. Track user U-9842 and this feature through every pipeline layer to expose and prevent such mismatches.

Towards AI
DAY 07May 4, 2026 MAY 4 · 20262 SUMMARIES
MarkTechPostData Science & Visualization

Production ML Pipelines with ZenML: Custom Materializers & HPO

ZenML enables end-to-end ML pipelines with custom DatasetBundle materializers for metadata-rich serialization, fan-out over 4 hyperparameter configs for RandomForest/GradientBoosting/LogisticRegression, fan-in best-model selection by ROC AUC, full artifact tracking, and cache-driven reproducibility on breast cancer dataset.

MarkTechPost
Google Cloud TechAI & LLMs

Scale GenAI to Billions of Rows in BigQuery at 94% Less Cost

BigQuery's optimized mode distills LLMs into lightweight models using embeddings, slashing token use by 94% (55M to 3M) and query time from 16min to 2min on 34k images or 50k voice commands, scaling to billions of rows.

DAY 08May 3, 2026 MAY 3 · 20262 SUMMARIES
MarkTechPostData Science & Visualization

Stream Parse TaskTrove Dataset for AI Task Insights

Stream multi-GB TaskTrove dataset without full download; parse gzip-compressed tar/zip/JSON binaries to analyze sources, sizes (median p50 KB compressed), filenames, and detect verifiers for RL-ready tasks via multi-signal heuristics.

MarkTechPost
Data Driven InvestorData Science & Visualization

Build Queryable Options IV DB from Live API Polls

Capture SpiderRock LiveImpliedQuote snapshots for TSLA every 10s into SQLite: append full history for audits (12k+ rows in 2min), upsert latest view per option_key. Query to reconstruct vol smiles and track ATM IV/skew changes over time.

DAY 09May 2, 2026 MAY 2 · 20262 SUMMARIES
MarkTechPostAI & LLMs

Parse, Analyze, Visualize Hermes Agent Traces for Fine-Tuning

Extract thoughts/tool calls from Hermes agent dataset with regex parsers; compute stats like avg turns per trajectory, tool frequencies, error rates; visualize patterns; tokenize with assistant-only labels for SFT on Qwen models.

MarkTechPost
Data and BeyondData Science & Visualization

Data Science Splits: Engineer Pipelines or Lead Decisions

Data scientist roles are dividing into technical data engineering (SQL up 18%, ETL up 18%) and strategic decision-making; AI automates mid-level generalist tasks, squeezing the middle—specialize in one side now.

DAY 10May 1, 2026 MAY 1 · 20264 SUMMARIES
MarkTechPostAI & LLMs

Autodata: Agents Create Superior Synthetic Training Data

Meta's Autodata deploys AI agents as data scientists to iteratively generate high-quality QA pairs from CS papers, outperforming CoT Self-Instruct by expanding weak-strong solver gaps from 1.9 to 34 points and boosting downstream model training.

MarkTechPost
Level Up CodingSoftware Engineering

Flink Treats Batch as Streaming for Unified Low-Latency Processing

Apache Flink processes unbounded streams and bounded batches with one engine using operators, state, windows, and exactly-once guarantees, eliminating dual codebases for real-time apps like recommendation engines handling millions of events.

Data and BeyondData Science & Visualization

Data And Beyond Grows to 49K Views, AI Topics Dominate

April 2026 stats: 49K views, 14.8K reads, +90 followers to 2K. Top stories cover Spark optimization, Claude AI leaks, clustering pitfalls, and RAG vs MCP.

Data and BeyondData Science & Visualization

Decompose Signals into Frequencies for Easier Analysis

Fourier transform breaks time-domain signals into frequency components, exposing periodic patterns buried in noise for filtering, compression, and fault detection—reversible and efficient via FFT.

DAY 11April 30, 2026 APR 30 · 20261 SUMMARIES
Google Cloud TechDevOps & Cloud

Bigtable Scales Petabytes for Real-Time NoSQL Workloads

Bigtable auto-scales to hundreds of petabytes and millions of ops/sec with low latency, powering Google Search/YouTube/Maps; ideal for time series, ML features, and streaming via Flink/Kafka integrations.

Google Cloud Tech
DAY 12April 29, 2026 APR 29 · 20261 SUMMARIES
Learning DataData Science & Visualization

ETL Pipeline Turns Messy HR Data into Star Schema Insights

Build a scalable ETL pipeline to restructure flat HR data into a star schema fact/dimension tables, enabling analysis of manager performance, diversity (60% White, 56.6% female), recruitment channels, and 71% accurate attrition prediction where tenure drives 47% of decisions.

Learning Data
DAY 13April 21, 2026 APR 21 · 20261 SUMMARIES
Learning DataData Science & Visualization

Automate Weekly PDF Reports with Python ETL Pipeline

Load/merge e-commerce datasets, compute revenue/profit/AOV/growth metrics, generate PDF with matplotlib/ReportLab charts and rule-based insights, email via smtplib, schedule weekly via GitHub Actions cron.

Learning Data
DAY 14April 20, 2026 APR 20 · 20263 SUMMARIES
Data Driven Investor

AI Amplifies Bad Data—Fix It First

AI doesn't fix poor data quality; it scales the errors, leading to wrong decisions like approving bad loans or prioritizing wrong customers. 85% of AI failures stem from bad data, so clean data before adopting AI.

Data Driven Investor
Level Up CodingData Science & Visualization

Preprocessing Swings CNN Accuracy from 65% to 87% on CIFAR-10

Raw CIFAR-10 pixels yield 65% test accuracy; normalization/standardization lift to 69%; geometric augmentation maintains ~67%; photometric brightness/contrast crashes to 20%; combined pipeline with deeper CNN hits 87%.

Data and BeyondData Science & Visualization

Launch Data Governance via Pilot Projects, Not Big Plans

Start data governance with a narrow pilot project as a starting line to prove value quickly, then scale incrementally while building self-sustaining mechanisms like legislation, judiciary, and enforcement.

DAY 15April 19, 2026 APR 19 · 20261 SUMMARIES
MarkTechPostData Science & Visualization

TabPFN Beats Tree Models on Tabular Accuracy with Zero Training

On a 5k-sample tabular dataset, TabPFN hits 98.8% accuracy vs CatBoost's 96.7% and Random Forest's 95.5%, with 0.47s setup but 2.21s inference due to in-context learning at predict time.

MarkTechPost
DAY 16April 18, 2026 APR 18 · 20261 SUMMARIES
Data and BeyondMarketing & Growth

Data And Beyond Doubles Followers to 2K in 10 Months

Medium data/AI publication grew from 1,000 to 2,000 followers in ~10 months, fueled by practical guides on AI agents, ML models, data tools, and analysis techniques—top post on vector databases.

Data and Beyond
DAY 17April 16, 2026 APR 16 · 20261 SUMMARIES
Data and BeyondData Science & Visualization

Cohort Analysis Exposes Donor Retention Risks

Rising aggregate retention (27% to 42%) hides leaky bathtub: 75% of 2025 revenue from 2024-2025 cohorts, with older cohorts contributing <2% each, risking collapse without long-term base.

Data and Beyond
DAY 18April 14, 2026 APR 14 · 20262 SUMMARIES
FlowingDataData Science & Visualization

Cleveland's Enduring Impact on Data Viz and Science

William Cleveland pioneered data visualization as a rigorous discipline via graphical perception studies and books like The Elements of Graphing Data, while outlining data science's foundations in 2001, shaping tools data workers use today.

FlowingData
Towards AIAI & LLMs

AI SQL: Strengths, 4 Pitfalls, and Fix Checklist

AI reliably generates simple aggregations and boilerplate SQL but fails on fanout joins, wrong window frames, NULL mishandling, and dialect mismatches. Use a detailed prompt template and 6-point review checklist to catch errors fast.

Showing 30 of 60