This Week I Learned
Learning is a fundamental part of our daily lives as software engineers. I've got so used to it that I often don't even notice when I learn something new. That's why I've created the "This Week I Learned" journal. Check it out below - or better yet, start your own!
Week of February 10, 2025: spurious correlations
Week of February 3, 2025: stringsAsFactors
Statisticians use the term “factors” to describe categorical variables, or enums. They are so essential that R coerces all character strings to be factors by default.
Why do we need factor variables to begin with? Because of modeling functions like ‘lm()’ and ‘glm()’. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix. There’s no way for R to know it should do this unless it has some extra information in the form of the factor class. From this point of view, setting ‘stringsAsFactors = TRUE’ when reading in tabular data makes total sense. If the data is just going to go into a regression model, then R is doing the right thing.
Week of January 27, 2025: ANTLR
https://www.antlr.org/ (plus a ton of community grammars)
Week of January 20, 2025: Common Table Expressions (CTEs)
Complex SQL queries can be broken down into smaller parts using Common Table Expressions (CTEs):
WITH FilteredOrders AS (
SELECT order_id, customer_id, total_amount
FROM orders
WHERE total_amount > 100
),
TopCustomers AS (
SELECT customer_id, COUNT(*) AS order_count
FROM FilteredOrders
GROUP BY customer_id
HAVING COUNT(*) > 3
)
SELECT customer_id, order_count
FROM TopCustomers;
Common Table Expressions (CTE) are part of the ANSI standard since SQL:1999. Beware that MySQL always materializes CTEs, which can introduce performance issues.
Week of January 13, 2025: Big Data is Dead
In 2004, when the Google MapReduce paper was written, it would have been very common for a data workload to not fit on a single commodity machine. […] Today, however, a standard instance on AWS uses a physical server with 64 cores and 256 GB of RAM. That’s two orders of magnitude more RAM. […]
One definition of “Big Data” is “whatever doesn’t fit on a single machine. By that definition, the number of workloads that qualify has been decreasing every year.
On a separate note, it’s a lot of fun to debug memory leaks in 256 GB RAM machines.
Week of January 6, 2025: Nemawashi
To ensure you have everyone’s support, it’s helpful to spend time on a consensus-building practice called nemawashi, a process of seeking approval from each significant person on a proposed project before committing to a group decision.