TL;DR

Researchers have developed an end-to-end pipeline to extract and analyze institutional affiliations from 5,356 ICLR 2026 accepted papers. The resulting dataset includes normalized affiliations and visualizations, offering insights into the landscape of AI research. The project aims to improve accuracy over author profiles and facilitate institutional impact analysis.

A new pipeline has transformed the analysis of ICLR 2026 accepted papers by generating a comprehensive, PDF-derived institutional affiliation dataset, which is now available for research and visualization purposes. This development offers a more accurate picture of the institutions shaping AI research, moving beyond unreliable author profiles.

The pipeline processes 5,356 accepted papers from ICLR 2026, extracting author affiliations directly from PDF title blocks to avoid issues like profile drift, where current jobs are incorrectly attributed to past papers. It normalizes institution names via approximately 250 rules, ensuring consistency across the dataset. The resulting data includes institution counts, country and region classifications, and detailed author affiliations, stored in multiple CSV formats for different analysis approaches.

Key outputs include a publication-level dataset, a ranked list of institutions by unique affiliation counts, and visualizations such as treemaps illustrating the research landscape. The project also compares different counting methods—per paper, first-author only, and fractional—to assess robustness and identify potential artifacts. The pipeline’s accuracy is approximately 96%, with fallback to author profile data for the remaining 4% where PDF parsing fails.

Why It Matters

This development matters because it provides a more precise and reliable view of institutional contributions to AI research at ICLR 2026. It enables stakeholders—researchers, institutions, and policymakers—to better understand research trends, collaboration patterns, and the influence of industry versus academia. The dataset enhances transparency and supports more accurate bibliometric analyses, which can influence funding, partnerships, and strategic decisions.

Amazon

PDF data extraction software

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Previous analyses relied heavily on author profiles from platforms like OpenReview, which are prone to inaccuracies due to profile drift. The new pipeline addresses this by extracting affiliations directly from PDF documents, which are the official source. This approach aligns with ongoing efforts in the research community to improve data quality and transparency in bibliometric studies. The pipeline builds on prior work in PDF parsing and normalization, applying these techniques specifically to ICLR 2026, one of the major AI conferences.

“This pipeline provides a more accurate, PDF-derived institutional affiliation dataset, avoiding the common pitfalls of author profile drift and enabling detailed analysis of research trends.”

— Dmytro Lopushanskyy, project lead

“The new affiliation dataset offers valuable insights into the global distribution of AI research and the evolving landscape of institutional contributions.”

— ICLR 2026 organizing committee

Amazon

institutional affiliation analysis tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear how the dataset will be adopted by the broader research community or integrated into existing bibliometric tools. Additionally, the accuracy of PDF parsing for non-standard or poorly formatted papers may vary, and the long-term stability of the normalization rules has yet to be tested across future conferences.

Amazon

bibliometric research tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include expanding the dataset to cover subsequent conferences, refining parsing algorithms for even higher accuracy, and encouraging community use to validate and improve the dataset. Researchers may also explore applying similar pipelines to other conferences or journals to build a comprehensive view of AI research trends.

Amazon

research visualization software

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

How does this dataset improve over previous author profile-based analyses?

The dataset extracts affiliations directly from PDF title blocks, reducing errors caused by profile drift and providing a more accurate representation of institutional contributions.

Can this pipeline be used for other conferences or journals?

Yes, the pipeline is designed to be adaptable, and with some modifications, it can process PDFs from other conferences or journals to generate similar affiliation datasets.

What are the main limitations of this approach?

The primary limitations include potential parsing errors with non-standard papers and the need for ongoing normalization rule updates to maintain accuracy across diverse formats.

How can researchers access and use this dataset?

The dataset is publicly available via the project’s GitHub repository and can be integrated into bibliometric analyses or visualizations to better understand research trends.

You May Also Like

Cutting Mat Board Like a Frame Shop: Techniques That Work

I can help you achieve professional mat board cuts like a frame shop—discover the essential techniques that will elevate your craftsmanship.

Servo vs Stepper Motors: Why Your Cuts Aren’t Consistent

Discover why your cuts lack consistency and how choosing the right motor type can make all the difference in your projects.

Choosing a Laser Bed Size: The ‘Too Small’ Mistake

Understanding the risks of a too-small laser bed helps avoid costly limitations and ensures your projects stay efficient and scalable.

Cold vs Hot Lamination: The Surprising Truth for Posters

Beware of choosing the wrong lamination method for your posters—discover the surprising differences between cold and hot lamination to ensure lasting protection.