TL;DR

Researchers have developed an end-to-end pipeline to extract and analyze institutional affiliations from 5,356 ICLR 2026 accepted papers. The resulting dataset includes normalized affiliations and visualizations, offering insights into the landscape of AI research. The project aims to improve accuracy over author profiles and facilitate institutional impact analysis.

A new pipeline has transformed the analysis of ICLR 2026 accepted papers by generating a comprehensive, PDF-derived institutional affiliation dataset, which is now available for research and visualization purposes. This development offers a more accurate picture of the institutions shaping AI research, moving beyond unreliable author profiles.

The pipeline processes 5,356 accepted papers from ICLR 2026, extracting author affiliations directly from PDF title blocks to avoid issues like profile drift, where current jobs are incorrectly attributed to past papers. It normalizes institution names via approximately 250 rules, ensuring consistency across the dataset. The resulting data includes institution counts, country and region classifications, and detailed author affiliations, stored in multiple CSV formats for different analysis approaches.

Key outputs include a publication-level dataset, a ranked list of institutions by unique affiliation counts, and visualizations such as treemaps illustrating the research landscape. The project also compares different counting methods—per paper, first-author only, and fractional—to assess robustness and identify potential artifacts. The pipeline’s accuracy is approximately 96%, with fallback to author profile data for the remaining 4% where PDF parsing fails.

Why It Matters

This development matters because it provides a more precise and reliable view of institutional contributions to AI research at ICLR 2026. It enables stakeholders—researchers, institutions, and policymakers—to better understand research trends, collaboration patterns, and the influence of industry versus academia. The dataset enhances transparency and supports more accurate bibliometric analyses, which can influence funding, partnerships, and strategic decisions.

Amazon

PDF data extraction software

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Previous analyses relied heavily on author profiles from platforms like OpenReview, which are prone to inaccuracies due to profile drift. The new pipeline addresses this by extracting affiliations directly from PDF documents, which are the official source. This approach aligns with ongoing efforts in the research community to improve data quality and transparency in bibliometric studies. The pipeline builds on prior work in PDF parsing and normalization, applying these techniques specifically to ICLR 2026, one of the major AI conferences.

“This pipeline provides a more accurate, PDF-derived institutional affiliation dataset, avoiding the common pitfalls of author profile drift and enabling detailed analysis of research trends.”

— Dmytro Lopushanskyy, project lead

“The new affiliation dataset offers valuable insights into the global distribution of AI research and the evolving landscape of institutional contributions.”

— ICLR 2026 organizing committee

Amazon

institutional affiliation analysis tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear how the dataset will be adopted by the broader research community or integrated into existing bibliometric tools. Additionally, the accuracy of PDF parsing for non-standard or poorly formatted papers may vary, and the long-term stability of the normalization rules has yet to be tested across future conferences.

Amazon

bibliometric research tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include expanding the dataset to cover subsequent conferences, refining parsing algorithms for even higher accuracy, and encouraging community use to validate and improve the dataset. Researchers may also explore applying similar pipelines to other conferences or journals to build a comprehensive view of AI research trends.

Amazon

research visualization software

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

How does this dataset improve over previous author profile-based analyses?

The dataset extracts affiliations directly from PDF title blocks, reducing errors caused by profile drift and providing a more accurate representation of institutional contributions.

Can this pipeline be used for other conferences or journals?

Yes, the pipeline is designed to be adaptable, and with some modifications, it can process PDFs from other conferences or journals to generate similar affiliation datasets.

What are the main limitations of this approach?

The primary limitations include potential parsing errors with non-standard papers and the need for ongoing normalization rule updates to maintain accuracy across diverse formats.

How can researchers access and use this dataset?

The dataset is publicly available via the project’s GitHub repository and can be integrated into bibliometric analyses or visualizations to better understand research trends.

You May Also Like

Single‑Wall Vs Double‑Wall Corrugated Boards

The key differences between single-wall and double-wall corrugated boards can significantly impact your packaging choice—discover which one suits your needs best.

Comparing Top Plotter Manufacturers: HP Vs Canon Vs Roland

Major differences between HP, Canon, and Roland plotters can influence your choice; discover which brand suits your needs best.

Biodegradable Vs Recyclable Packaging: What’S the Difference?

Sustainable packaging choices matter—discover the key differences between biodegradable and recyclable options to make environmentally smarter decisions.

Spot Color vs CMYK: What Designers Should Know Before Printing

Thet’s why understanding the key differences between spot color and CMYK is essential before printing—discover which option best suits your design needs.