Profile

Hi, I'm Kate

I’m a Full Stack Developer and Engineering Mentor, obsessed with regular expressions, books, and web technologies. In my work, I mix old with new, soft with hard, cats with dogs. When it’s not a disaster, it’s pure magic!

Latest content

Book reviews

This week I learned

Week of February 3, 2025: stringsAsFactors

Statisticians use the term “factors” to describe categorical variables, or enums. They are so essential that R coerces all character strings to be factors by default.

Why do we need factor variables to begin with? Because of modeling functions like ‘lm()’ and ‘glm()’. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix. There’s no way for R to know it should do this unless it has some extra information in the form of the factor class. From this point of view, setting ‘stringsAsFactors = TRUE’ when reading in tabular data makes total sense. If the data is just going to go into a regression model, then R is doing the right thing.

Previous weeks are available in the archive.