Fix explanation#47
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the DataFrame creation in woodtapper/example_sampling/base.py by using np.column_stack to combine features and targets instead of concatenating separate DataFrames. Feedback highlights that using np.column_stack forces a homogeneous NumPy dtype, potentially causing unwanted upcasting (e.g., converting integer targets to floats). Additionally, the hardcoded 'target' column name could cause a shape mismatch for multi-output targets. A code suggestion was provided to assign target columns dynamically to preserve dtypes and handle multi-output targets correctly.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| feature_names = [f"feature_{j}" for j in range(self.train_X.shape[1])] + [ | ||
| "target" | ||
| ] | ||
| for i in range(most_similar_idx.shape[0]): | ||
| df_covariates = pd.DataFrame( | ||
| self.train_X[most_similar_idx[i]], | ||
| columns=[f"feature_{j}" for j in range(self.train_X.shape[1])], | ||
| combined = np.column_stack( | ||
| ( | ||
| self.train_X[most_similar_idx[i]], | ||
| self.train_y[most_similar_idx[i]], | ||
| ) | ||
| ) | ||
| df_target = pd.DataFrame( | ||
| self.train_y[most_similar_idx[i]], columns=["target"] | ||
| ) | ||
| df = pd.concat([df_covariates, df_target], axis=1) | ||
| df = pd.DataFrame(combined, columns=feature_names) | ||
| list_.append(df) |
There was a problem hiding this comment.
Using np.column_stack to combine self.train_X and self.train_y before creating the DataFrame forces a single homogeneous NumPy dtype. If self.train_X contains floats and self.train_y contains integers (e.g., class labels) or objects, they will be upcast to a common dtype (such as float64 or object). This results in losing the original dtypes in the returned DataFrame (e.g., integer targets becoming floats).
Additionally, if self.train_y is multi-output (2D), the hardcoded "target" column name in feature_names will cause a shape mismatch error when creating the DataFrame.
We can avoid both issues and improve performance by creating the DataFrame from self.train_X first, and then assigning the target column(s) directly.
| feature_names = [f"feature_{j}" for j in range(self.train_X.shape[1])] + [ | |
| "target" | |
| ] | |
| for i in range(most_similar_idx.shape[0]): | |
| df_covariates = pd.DataFrame( | |
| self.train_X[most_similar_idx[i]], | |
| columns=[f"feature_{j}" for j in range(self.train_X.shape[1])], | |
| combined = np.column_stack( | |
| ( | |
| self.train_X[most_similar_idx[i]], | |
| self.train_y[most_similar_idx[i]], | |
| ) | |
| ) | |
| df_target = pd.DataFrame( | |
| self.train_y[most_similar_idx[i]], columns=["target"] | |
| ) | |
| df = pd.concat([df_covariates, df_target], axis=1) | |
| df = pd.DataFrame(combined, columns=feature_names) | |
| list_.append(df) | |
| feature_names = [f"feature_{j}" for j in range(self.train_X.shape[1])] | |
| for i in range(most_similar_idx.shape[0]): | |
| df = pd.DataFrame(self.train_X[most_similar_idx[i]], columns=feature_names) | |
| if self.train_y.ndim == 1: | |
| df["target"] = self.train_y[most_similar_idx[i]] | |
| elif self.train_y.shape[1] == 1: | |
| df["target"] = self.train_y[most_similar_idx[i]][:, 0] | |
| else: | |
| for j in range(self.train_y.shape[1]): | |
| df[f"target_{j}"] = self.train_y[most_similar_idx[i]][:, j] | |
| list_.append(df) |
No description provided.