Back to Articles

ROAPI: Turn a Parquet File into a REST API in One Command

[ View on GitHub ]

ROAPI: Turn a Parquet File into a REST API in One Command

Hook

What if you could query a 50GB Parquet file sitting in S3 with GraphQL, connect to it with psql, and expose it as a REST API—all without writing a single line of code or spinning up a database?

Context

Data engineers live in a peculiar purgatory. You’ve got gigabytes of analytics data exported as Parquet files, CSV dumps from data pipelines, or JSON logs sitting in S3. Analysts want to query it with SQL. Frontend developers need a REST API. BI tools expect a Postgres connection. The traditional solution? Import everything into a database, write API endpoints, configure ORMs, manage schemas, handle migrations. Hours of plumbing work just to expose static data.

ROAPI exists because this dance is absurd for read-only datasets. Built on Apache Arrow and DataFusion, it’s a zero-code API generator that treats files as first-class queryable resources. Point it at a Parquet file, CSV, Delta Lake table, or even a Google Sheet, and instantly get REST, GraphQL, SQL, and Postgres wire protocol access. No ETL. No schema definitions. No server code. It’s the realization that most analytical data doesn’t need ACID transactions or write capabilities—it just needs to be queried efficiently through whatever interface your tools expect.

Technical Insight

ROAPI’s architecture is refreshingly simple: it’s a translation layer sitting atop DataFusion’s query engine. When you start ROAPI, you define tables via CLI flags or a config file, pointing to data sources. The tool automatically infers schemas using Arrow’s type system, loads data into columnar format, and exposes multiple query frontends that all funnel into the same DataFusion execution engine.

Here’s the simplest possible deployment—turning a Parquet file into a queryable API:

roapi --table "users=s3://my-bucket/users.parquet"

That’s it. ROAPI now serves a REST API at http://localhost:8080/api/tables/users with automatic filtering, sorting, and projection. You can query with URL parameters: /api/tables/users?filter=age>25&select=name,email&limit=100. For more complex queries, ROAPI exposes a SQL endpoint that accepts POST requests with raw SQL in the body, executing them directly against your Parquet file through DataFusion.

The real power emerges with the config-based approach. Create a roapi.yml:

tables:
  - name: analytics_events
    uri: "s3://data-lake/events/*.parquet"
  - name: user_profiles
    uri: "https://api.example.com/exports/users.csv"
  - name: product_catalog
    uri: "./local/products.json"
    option:
      format: "json"
      array_encoded: true

Now run roapi -c roapi.yml and you’ve got three tables queryable via multiple interfaces. The architecture handles format detection automatically—Parquet gets memory-mapped for zero-copy reads, CSV streams through Arrow’s parser, JSON converts to columnar representation. DataFusion’s query optimizer handles predicates, projections, and aggregations.

The Postgres wire protocol support is particularly clever. ROAPI implements enough of the Postgres protocol that tools like psql, pgAdmin, or even ORMs can connect:

psql -h localhost -p 5433

Inside psql, you can run standard SQL against your files:

SELECT 
  date_trunc('day', timestamp) as day,
  count(*) as events,
  avg(duration) as avg_duration
FROM analytics_events
WHERE event_type = 'page_view'
GROUP BY day
ORDER BY day DESC
LIMIT 30;

Under the hood, ROAPI parses the Postgres wire protocol messages, translates the SQL into a DataFusion logical plan, executes it over the Arrow columnar data, and streams results back in Postgres format. Your client has no idea it’s talking to Parquet files instead of a real database.

The GraphQL interface adds another dimension. ROAPI auto-generates GraphQL schemas from table metadata, supporting filtering, sorting, and field selection:

query {
  user_profiles(filter: "country='US'", limit: 10) {
    name
    email
    created_at
  }
}

All query frontends—REST, SQL, GraphQL, Postgres wire protocol—share the same DataFusion execution engine, which means optimizations apply universally. DataFusion’s cost-based optimizer handles predicate pushdown, partition pruning (for partitioned Parquet files), and columnar batch processing. When you filter on a column, only that column gets decoded. When you aggregate, DataFusion uses SIMD-accelerated columnar operations.

For production deployments, ROAPI supports key-value lookups by designating specific columns as keys. This enables fast point queries without scanning entire datasets—critical for use cases like enrichment APIs or lookup tables. The tool also handles schema evolution gracefully through Arrow’s schema union capabilities, automatically adapting to new columns in appended Parquet files.

Gotcha

ROAPI’s biggest limitation is baked into its design: it’s read-only, period. There’s no UPDATE, INSERT, or DELETE support. This isn’t an oversight—it’s a deliberate trade-off that enables the zero-code promise. But it means you can’t use ROAPI for applications that need even occasional writes. That “slowly moving datasets” descriptor in the tagline is doing heavy lifting. If your data updates more frequently than you can restart ROAPI or refresh from source, you’re fighting the tool’s purpose.

The dynamic table registration feature (the -d flag) sounds more flexible than it is. While it allows runtime table creation via API calls, you still need to configure initial tables, and there’s no persistence layer—restart ROAPI and your dynamically registered tables vanish. It’s marked experimental for good reason. For truly dynamic workloads, you’re better off with a real database.

Performance can also surprise you. DataFusion is impressive for an in-memory query engine, but it’s not tuned for operational database workloads. Complex joins across large tables, deeply nested aggregations, or queries that can’t leverage columnar batch processing will be slower than specialized OLAP databases like ClickHouse or DuckDB’s embedded mode. ROAPI shines for analytical queries over pre-aggregated data, not as a general-purpose query engine. If you’re hitting query timeouts, you’ve probably outgrown the tool and need something with more sophisticated indexing and optimization.

Verdict

Use ROAPI if you’re exposing analytical datasets for exploration, building data APIs over blob storage (S3/GCS), prototyping analytics products, or need multi-protocol access (SQL + GraphQL + REST) to static files without infrastructure overhead. It’s perfect for data engineers who want to skip the “import to Postgres” step, analysts who need SQL access to Parquet exports, or developers building read-only dashboards over data lake files. The tool excels when your data update frequency is measured in hours or days, not seconds. Skip ROAPI if you need write capabilities of any kind, require real-time data freshness, have complex transactional workflows, or need production-scale query optimization with indexing strategies. Also skip it if your queries are computationally expensive—at that scale, purpose-built OLAP databases will outperform DataFusion significantly. Think of ROAPI as curl for data files: phenomenal for quick access and prototyping, but you’ll graduate to specialized tools as requirements grow.