About me

Welcome! I’m a NYC-based Data Science recent grad from UC Berkeley’s CDSS college, originally from Guatemala City, Guatemala.

I’m proefcient in Python, SQL, statistical modeling, A/B testing, and building data products end-to-end, cleaning, feature engineering to analysis, dashboards, and reporting to stakeholders. My work focuses on making messy data usable and communicating results clearly to both technical and non-technical audiences.

I focus on using statistics, data engineering, machine learning and data analytics. As a Data Science Intern at UC Berkeley Financial Aid & Scholarships, I designed Python/SQL data pipelines and monitoring tools, cleaned and integrated multi-year student and aid datasets into relational schemas, ran A/B tests on outreach emails and student portals to boost FAFSA completion and key financial-aid form responses, and built AWS/GCP dashboards.

At the U.S. Health Department (HHS), I developed regression and optimization models and A/B tests on national CIL funding and utilization data, helped automate ETL pipelines with Python and APIs, and built BigQuery and Tableau dashboards to surface funding and usage trends for non-technical decision-makers.

As a Lead Researcher in Berkeley’s CDSS Discovery Research Program, I led a team analyzing federal reports for 350+ Centers for Independent Living, cleaning and standardizing multi-year CSVs into a state-level dataset, merging them with U.S. Census disability statistics, and engineering coverage and funding-per-disabled-resident metrics to capture “good” CIL performance. I used Python-based NLP (NLTK/VADER), correlation analysis, and visualizations (pandas, matplotlib, seaborn, Plotly) to relate funding, coverage, and reported achievements/challenges, and presented our results through Sigma/Tableau dashboards to HHS partners at Mathematica and at the 2024 CDSS Data Discovery Spring Symposium.

Besides programming and math, I also love to travel, to read and trying out new things to eat, specially from other parts of the world!

Feel free to browse my work and contact me through LinkedIn, email or leave a comment on the "Contact" tab!

Resume

Download Resume

  1. In this page you'll find the extended version of my resume (all projects, 'Clubs & Leadership' and past education) or feel free to just download the 1-page version!

    Main Resume

Education

  1. University of California, Berkeley (Berkeley, California)

    2021 — 2025

    B.A. Data Science with Domain Emphasis in Industrial Analytics (CDSS college)
    Relevant Coursework: Data Engineering, Data Mining & Analytics, Probability for Data Science, Data Inference & Decisions, Principles & Techniques of Data Science, Data Structures & Algorithms.
    Awards & Scholarships: Generation Change Scholar, James Hjul Scholar, Allmond Scholar, Albert Job Scholar.

  2. IB diploma at Centro Escolar el Roble (Guatemala City, Guatemala)

    2019 — 2020

    Summa Cum Laude
    Activities and Societies: El Roble soccer team, volunteering at Guatemala's neurological institute (Instituto Neurológico de Guatemala) and the American Red Cross.

Experience

  1. Data Science Intern

    University of California Berkeley, Financial Aid Office (FAO)

    Berkeley, California

    Mar 2022 — Present

    – Cleaned and integrated multi-year student and aid datasets into relational schemas, optimizing SQL queries for large-scale reporting.
    – Ran A/B tests on outreach emails and student portals to boost FAFSA completion and critical financial-aid form response rates.
    – Built AWS/GCP dashboards on aid usage, unmet need, retention, and first-gen/low-income outcomes to support

  2. Data Science Intern

    U.S. Department of Health and Human Services (HHS)

    Washington DC (Remote)

    Jun 2024 - Sep 2024

    – Developed regression and optimization models and A/B tests on CIL funding and utilization data (2019 vs. 2022) to quantify post-pandemic underperformance.
    – Helped to automate ETL pipelines using Python and APIs, reducing data processing time by 30%.
    – Built BigQuery and Tableau dashboards summarizing funding and usage trends for non-technical decision-makers.
    – Synthesized pre-pandemic and post-pandemic funding and utilization findings with executive directors, contributing statistical analyses and visualizations to a federal report scheduled for release in 2026.

  3. Lead Researcher @ CDSS Data Discovery Program

    UC Berkeley College of Computing, Data Science, and Society

    Washington DC (Remote)

    Jan 2024 - May 2024

    -Led a research team for the U.S. Health Department branch (ACL), designing and implementing advanced ETL pipelines to clean and structure unorganized datasets for 350 centers nationwide.
    -Developed SQL and PostgreSQL-based data models, optimizing storage solutions to ensure efficient query performance and scalability.
    -Engineered comprehensive end-to-end workflows to extract, transform, and load (ETL) data, providing actionable insights that directly influenced federal policy decisions.

Projects

  1. Endorsements & Outcomes: Causal & Predictive Analysis of 2022 U.S. Primaries

    Mar 2025 – May 2025

    -Asked two things: do endorsements cause wins, and can we predict results?
    -Cleaned/merged FiveThirtyEight data; engineered features (endorsements, incumbency, fundraising, state/party).
    -Causal: drew a DAG to pick confounders; estimated propensity scores (logit); nearest-neighbor matching + IPW; checked balance (SMD); estimated ATE/ATT with bootstrap CIs.
    -Predictive: stratified splits; L1/L2 logistic regression and class-weighted Random Forest; threshold tuning.
    -Validation: 5-fold CV, ROC-AUC/precision/recall/F1, calibration; model explanation via permutation importance/SHAP.
    -Result: endorsements—especially Trump’s—and incumbency were most predictive; best RF reached ~0.78 F1.

  2. Streaming Analysis(ML)

    Nov 2024 – Dec 2024

    -Engineered a churn label from Netflix user subscription data using join/last-payment dates and computed overall churn rates. -Built churn prediction models in Python with pandas and scikit-learn, including MLP neural networks, decision trees, and tuned random forests with GridSearchCV and feature importance. -Reduced feature space with PCA and performed customer segmentation using K-Means clustering. -Used IsolationForest to flag high-risk churn users and visualized churn patterns by country, age, device, and cluster with matplotlib/seaborn.

  3. Yelp Insights and NoSQL Data Processing with MongoDB

    Nov 2024 – Dec 2024

    -Modeled Yelp users, businesses, and reviews across MongoDB and PostgreSQL to compare document vs relational schemas and join strategies. -Wrote MongoDB aggregation pipelines (via PyMongo) for geospatial filters, city/state rollups, and keyword-based “to_avoid” review flags using text search. -Built SQL queries and joins in PostgreSQL to mirror MongoDB lookups and analyze performance differences with EXPLAIN ANALYZE. -Sampled data into pandas to study missingness, explode categories, and construct cleaner business feature tables for downstream analysis.

  4. Campus Sensor Data Cleaning and Time-Series Interpolation in PostgreSQL

    Oct 2024 – Nov 2024

    -Cleaned and standardized campus HVAC and energy sensor data in PostgreSQL, resolving inconsistent units and messy building/location metadata. -Used SQL (including JSON extraction) to explore schema, validate key relationships, and produce analysis-ready sensor tables. -Implemented robust outlier detection and winsorization with median/MAD to stabilize noisy sensor readings. -Built regular 15-minute time-series grids using GENERATE_SERIES and window functions with forward/backward fills and linear interpolation for missing data.

  5. Neural Network for Survival Rates in Titanic

    Sep 2024 – Oct 2024

    -Built an end-to-end survival prediction model on the Titanic Kaggle dataset. -Engineered features and handled missing data. -Used scikit-learn pipelines for preprocessing plus a regularized neural network with dropout and batch norm. -Evaluated model performance and generated a Kaggle-ready CSV submission of survival predictions.

  6. BYOW (build your own world)

    Nov 2023 - Dec 2023

    -Generated a 2D avatar world based on different data structures (using a user's seed so worlds are repeatable and deterministic) by writing +3000 lines of Java quality code.
    -Guaranteed a connected map and smooth avatar movement; used BFS for connectivity and a wall/floor collision map with W/A/S/D loop.
    -Implemented save/load and testing; serialized game state (:Q/L), parsed inputs, and wrote a range of JUnit tests.
    -Employed LinkedLists, ArrayLists and HashSets to provide rapid data retrieval and management.

  7. NGordNet

    Oct 2023 – Nov 2023

    -Browser-based tool for exploring the history of word usage in English texts using large datasets (50,000 words).
    -Ingested Google N-grams and WordNet; implemented TimeSeries/NGramMap with HashMap/TreeMap, normalized frequencies, and efficient year-range aggregation.
    -Modeled WordNet as a synset DAG; computed hyponyms via BFS/DFS + set unions; cached results to avoid repeated traversals.
    -Exposed endpoints with lightweight Java web handlers (Spark/Jetty scaffold), returning JSON for frequency-filtered hyponyms across time windows.
    -Wrote JUnit tests; profiled asymptotics; reduced I/O via memoization and precomputed indices.

  8. Spam and 'Ham'

    Nov 2023 – Nov 2023

    -Loaded and cleaned labeled spam/ham emails in pandas (lowercasing, missing-text imputation). -Explored keyword patterns in spam vs ham using custom binary word indicators and seaborn visualizations. -Engineered a feature matrix with a words_in_texts function capturing presence of selected terms. -Trained and evaluated a scikit-learn logistic regression spam classifier vs a naive baseline using accuracy, precision, recall, and false-positive rate.

Clubs & Leadership

  1. Computer Science Intern

    Open Project @ Berkeley

    Berkeley, California

    Jan 2023 - Jun 2023

    -Participated in one of Berkeley's biggest CS clubs.
    -Aimed to have a diverse impact on a plethora of student problems. In this semester my team and I were working to make a better version of the University's official portal for students.

  2. Forum Committee Vice President

    Latin American Leadership Society @ Berkeley

    Berkeley, California

    Jan 2022 - Jun 2022

    -Organized and executed large-scale events for 100+ attendees, leveraging data-driven tools to streamline scheduling and participation.
    -Utilized technology platforms to optimize communication and solve logistical challenges, enhancing overall event execution.
    -Promoted awareness of complex societal issues affecting Latin American communities both in the U.S. and abroad, fostering meaningful discussions and solutions.

Portfolio

Academia

Download Diploma

  1. In this page you'll find a deeper explanation of coursework taken at Berkeley and my bachelor's diploma.

    B.A. Data Science, University of California Berkeley

Academia

  1. Principles and Techniques of Data Science (DATA100)

    Pandas/NumPy, SQL, EDA/visualization; feature engineering; linear/logistic regression

  2. Data Engineering (DATA101)

    SQL, schema design, indexing, transactions; ETL pipelines with Pandas/Spark; files/partitioning; cloud storage; basic Airflow-style orchestration.

  3. Data Inference and Decisions (DATA102)

    Hypothesis tests, A/B testing, causal inference, logistic/linear models, Bayesian Stats; NumPy/Pandas/Statsmodels/Scikit-learn.

  4. Data Mining & Analytics (DATA144)

    Feature engineering; clustering/classification/regression, PCA; trees/ensembles, model selection; Scikit-learn, Pytorch

  5. Probability for Data Science (DATA140)

    Probability, random variables, Bayes, LLN/CLT, Markov chains; MLE/MAP; Python/numpy/scipy libraries.

  6. Structure and Interpretation of Computer Programs (CS61A)

    Python fundamentals, higher-order functions, recursion, OOP, Scheme, SQL, recursion.

  7. Data Structures and Algorithms (CS61B)

    Java data structures (lists, trees, graphs, heaps, hash tables), algorithms, complexity, JUnit/Git, discrete math.

  8. Foundations of Data Science (DATA8)

    Python, NumPy/Pandas, visualization, sampling, basic inference/regression.

  9. The Beauty and Joy of Programming (CS10)

    Computational thinking, abstraction, algorithms; Snap!/Python projects.

  10. Introduction to Biomedicine for Engineers (BIOENG10)

    Physiology/biomed basics; quantitative modeling and analysis (MATLAB/Python).

  11. Linear Algebra And Differential Equations (MATH54)

    Linear algebra, differential equations for modeling, systems, and optimization

  12. Math for Scientists and Engineers (MATH1A)

    Single-variable calculus (limits, derivatives, integrals).

  13. Math for Scientists and Engineers II (MATH1B)

    Integral techniques, series, differential equations, double integration, partial derivatives.

  14. Physics for Scientists and Engineers (PHYSICS7A)

    Mechanics/energy/oscillations; lab data analysis and error propagation.

Contact

Contact Form