Fix type of the a batch returned by make_batch_reader when TransformSpec's function returns column with all values being None#750
Conversation
|
|
||
| @pytest.mark.parametrize('null_column_dtype', [np.float64, np.unicode_]) | ||
| @pytest.mark.parametrize('reader_factory', _D) | ||
| def test_transform_spec_returns_all_none_values(scalar_dataset, null_column_dtype, reader_factory): |
There was a problem hiding this comment.
These test cases assume the transform_spec func is creating the null values. In the more common case, there are missing values in fields unedited by the transform_spec. I believe this solution already addresses both cases but it would be good to demonstrate this in the tests either with an additional test case or additional non-edited fields in this test case.
| @pytest.mark.parametrize('reader_factory', _D) | ||
| def test_transform_spec_returns_all_none_values_in_a_list_field(scalar_dataset, reader_factory): | ||
| def fill_id_with_nones(x): | ||
| return pd.DataFrame({'int_fixed_size_list': [[None for _ in range(3)]] * len(x)}) |
There was a problem hiding this comment.
We also ran into the NoneType issue with lists of strings. Consider adding string types to the test as well.
The NoneType problem also occurs when only some of the values in the list are None, e.g. ['a', 'b', None]. What about a test_transform_spec_returns_some_none_values_in_a_list_field?
…rSpec's function sets an entire column to None. Resolves uber#744 We implemented a Unischema->Pyarrow-schema conversion and explicitly set the pyarrow schema when converting a pandas dataframe returned by transform spec function into a pyarrow table. This way, pyarrow does not have to guess the type of data from the data itself (which it obviously could not do before, since all values were None).
The test tests properly columns with scalars only. Will need to verify correct behavior with columns that are lists separately. Will extend the tests in the following commits.
…all elements being None. Change the type of numpy dtype to np.object instead of np.unicode_
5d28818 to
1fcf22f
Compare
|
Yevgeni Litvin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Resolves #744
We implemented a Unischema->Pyarrow-schema conversion and explicitly
set the pyarrow schema when converting a pandas dataframe returned by
transform spec function into a pyarrow table. This way, pyarrow does not
have to guess the type of data from the data itself (which it obviously
could not do before, since all values were None).