# README
File Data Source
Turn files into a SQL Queryable data source.
Allows Cloud storage (Google Storage, S3, etc) files (csv, json, custom-protobuf) to be queried with traditional sql. Also allows these files to have custom serializations, compressions, encryptions.
Design Goal
- Hackable Stores easy to add to Google Storage, local files, s3, etc.
- Hackable File Formats Protbuf files, WAL files, mysql-bin-log's etc.
Developing new stores or file formats
- FileStore defines file storage Factory (s3, google-storage, local files, sftp, etc)
- StoreReader a configured, initilized instance of a specific FileStore.
File storage reader(writer) for finding lists of files, and opening files. - FileHandler FileHandlers handle processing files from StoreReader, developers
Register FileHandler implementations in Registry. FileHandlers will create
FileScanner
that iterates rows of this file. - FileScanner File Row Reading, how to transform contents of file into qlbridge.Message for use in query engine. Currently CSV, Json types.
Example: Query CSV Files
We are going to create a CSV database
of Baseball data from
http://seanlahman.com/baseball-archive/statistics/
# download files to local /tmp
mkdir -p /tmp/baseball
cd /tmp/baseball
curl -Ls http://seanlahman.com/files/database/baseballdatabank-2017.1.zip > bball.zip
unzip bball.zip
mv baseball*/core/*.csv .
rm bball.zip
rm -rf baseballdatabank-*
# run a docker container locally
docker run -e "LOGGING=debug" --rm -it -p 4000:4000 \
-v /tmp/baseball:/tmp/baseball \
gcr.io/dataux-io/dataux:latest
In another Console open Mysql:
# connect to the docker container you just started
mysql -h 127.0.0.1 -P4000
-- Now create a new Source
CREATE source baseball WITH {
"type":"cloudstore",
"schema":"baseball",
"settings" : {
"type": "localfs",
"format": "csv",
"path": "baseball/",
"localpath": "/tmp"
}
};
show databases;
use baseball;
show tables;
describe appearances
select count(*) from appearances;
select * from appearances limit 10;
TODO
- implement event-notification
- SFTP
- Change the store library? Currently https://github.com/lytics/cloudstorage but consider:
# Functions
Convert a cloudstorage object to a File.
FileStoreLoader defines the interface for loading files.
NewFilePager creates default new FilePager.
NewFileSource provides a singleton manager for a particular Source Schema, and File-Handler to read/manage all files from a source such as gcs folder x, s3 folder y.
NewJsonHandler creates a json file handler for paging new-line delimited rows of json file.
NewJsonHandler creates a json file handler for paging new-line delimited rows of json file.
RegisterFileHandler Register a FileHandler available by the provided @scannerType.
RegisterFileStore global registry for Registering implementations of FileStore factories of the provided @storeType.
No description provided by the author
Find table name from full path of file, and the path of tables.
# Constants
SourceType is the registered Source name in the qlbridge source registry.
# Variables
Default file queue size to buffer by pager.
FileColumns are the default columns for the "file" table.
# Structs
FileInfo describes a single file Say a folder of "./tables" is the "root path" specified then say it has folders for "table names" underneath a redundant "tables" ./tables/ /tables/ /appearances/appearances1.csv /players/players1.csv Name = "tables/appearances/appearances1.csv" Table = "appearances" PartialPath = tables/appearances.
FilePager acts like a Partitionied Data Source Conn, wrapping underlying FileSource and paging through list of files and only scanning those that match this pagers partition - by default the partitionct is -1 which means no partitioning.
FileReader file info and access to file to supply to ScannerMakers.
FileSource Source for reading files, and scanning them allowing the contents to be treated as a database, like doing a full table scan in mysql.
No description provided by the author
# Interfaces
FileHandler defines an interface for developers to build new File processing to allow these custom file-types to be queried with SQL.
FileHandlerSchema - file handlers may optionally provide info about tables contained.
FileHandlerTables - file handlers may optionally provide info about tables contained in store.
FileReaderIterator defines a file source that can page through files getting next file from partition.
FileStore Defines handler for reading Files, understanding folders and how to create scanners/formatters for files.
# Type aliases
FileStoreCreator defines a Factory type for creating FileStore.
No description provided by the author