Skip to content

[df] Allow reading a char column into a numpy array#22572

Open
vepadulano wants to merge 1 commit into
root-project:masterfrom
vepadulano:gh-22554
Open

[df] Allow reading a char column into a numpy array#22572
vepadulano wants to merge 1 commit into
root-project:masterfrom
vepadulano:gh-22554

Conversation

@vepadulano

Copy link
Copy Markdown
Member

In the AsNumpy operation values of the dataset are read into a ROOT::RVec collection of the corresponding column type. Subsequently, the raw data is accessed from the RVec and used to generate the array interface for a numpy array view on the collected data.

When the column is of type char, and thus RDF would read values into a ROOT::RVec, the raw data is accessed as a 'char *'. The Python bindings automatically convert 'char *' and 'const char *' to Python strings for full compatibility with existing functions (e.g. otherwise TObject::GetName would not return a string in Python). Thus, the array interface cannot be generated.

This commit proposes to introduce a special behaviour in AsNumpy to automatically view the char column as an 'unsigned char' column. This in turn will not incur in the automatic conversion on the Python side. An array of 'unsigned char' is interpreted as a numpy array with dtype uint8.

Since this is a decision which might be unexpected to some users, the commit also proposes to let the user know about it via a warning.

Fixes #22554

@guitargeek guitargeek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! If we're already printing a warning, can we use this opportunity to suggest more appropriate column types?

In the AsNumpy operation values of the dataset are read into a ROOT::RVec
collection of the corresponding column type. Subsequently, the raw data is
accessed from the RVec and used to generate the array interface for a numpy
array view on the collected data.

When the column is of type char, and thus RDF would read values into a
ROOT::RVec<char>, the raw data is accessed as a 'char *'. The Python bindings
automatically convert 'char *' and 'const char *' to Python strings for full
compatibility with existing functions (e.g. otherwise TObject::GetName would not
return a string in Python). Thus, the array interface cannot be generated.

This commit proposes to introduce a special behaviour in AsNumpy to
automatically view the char column as an 'unsigned char' column. This in turn
will not incur in the automatic conversion on the Python side. An array of
'unsigned char' is interpreted as a numpy array with dtype uint8.

Since this is a decision which might be unexpected to some users, the commit
also proposes to let the user know about it via a warning.

@guitargeek guitargeek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RDataFrame] char type is not recognized in RDF.AsNumpy

2 participants