What do stacking and unstacking mean from a mathematical point of view?
First one has to understand what it means to represent data in a data frame.
I like to think of a column of a data frame as a function.
I am going to assume that all the columns are of the same type.
In my notation, I will assume that the “innermost index” is last.
A dataframe thus represents a function of type
\[
(\ldots,S_1, S_0) \to (\ldots,X_1, X_0) \to Y
\]
This means that the index (row) are given by \(\ldots,S_1,S_0\), and
the column indices are \(\ldots,X_1,X_0\).
The operation of stacking is the transformation to the type:
\[
(\ldots,S_1, S_0, X_0) \to (\ldots,X_1) \to Y
\]
This operation makes the dataframe taller.
The operation of unstacking is the reverse transformation, namely:
\[
(\ldots,S_1) \to (\ldots, X_1, X_0, S_0) \to Y
\]
This operation makes the dataframe wider.
There are a few outstanding and very influential books on machine learning.
Note that there are many more covering other topics such as machine learning from a classical statistics perspective, or practical issue such as deep learning.
These outstanding and highly influential books focus on core machine learning and Bayesian statistics concepts.
All are available as free PDFs (click on the book covers).

The go-to reference for fundamental machine learning concepts. Covers regression, kernel methods, graphical models, and variational inference with clear explanations and practical examples.
It’s never earth-shattering but it quickly gives you the essence of those methods.
A unique masterpiece by one of the brightest minds in the field. More advanced than Bishop, it offers profound insights into statistics, entropy, and information theory. Don’t miss the exceptional proof of the noisy channel theorem in §10.3.
Murphy’s books and survey articles are comprehensive references covering nearly all machine learning algorithms, though less groundbreaking than the others.