Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive/Incremental file listing in HDFSInputFormat #302

Open
kygx-legend opened this issue Nov 27, 2017 · 1 comment
Open

Recursive/Incremental file listing in HDFSInputFormat #302

kygx-legend opened this issue Nov 27, 2017 · 1 comment

Comments

@kygx-legend
Copy link
Member

kygx-legend commented Nov 27, 2017

In current version, HDFSInputFormat reads the first directory(path) only. For example, if the path is /data, it will list the directory of /data and read the items(must be file) like /data/a and /data/b.

In order to be more flexible, it could support reading an organized path recursively(all files are in the last directories). For example, if the data is stored as a time-based path like /data/year/month/dates/FILES, it prefers scanning all items in path '/data' rather than giving a concrete path '/data/year/month/dates`. Of course, we need to set the maximum recursive layers to avoid the tremendous reading.

@ddmbr
Copy link
Member

ddmbr commented Nov 27, 2017

And it would be better if we can avoid listing all the files at once, as there could be too many files. We could list the files batch by batch.

@ddmbr ddmbr changed the title HDFSInputFormat can be improved Recursive/Incremental file listing in HDFSInputFormat Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants