19.12.2025, Vienna
Imagine being commissioned to produce a regional economic analysis–municipal growth diagnostic–only to learn that the administrative microdata critical for your analysis is unavailable. This was the situation in which my team and I found ourselves a few months back. If the year was 2015, our team of analysts would have been doomed to fail. Luckily, it was 2025.
Our objective was to conduct a municipal growth diagnostic for a mid-sized town in Albania. Pioneered by the Growth Lab at Harvard University, a municipal growth diagnostic builds on the framework of country-level growth diagnostics developed by Hausmann, Rodrik and Velasco (2008). It involves a principled analysis of the factors leading to low levels of economic investment in a place, and aims to help policy makers identify the most binding constraints to growth. Such analysis would typically study a place’s growth potential, and what stands in the way of this potential.
A municipal growth diagnostic is a data-hungry task. To diagnose causes of underinvestment properly, we need data to study a place’s demographic development, human capital, infrastructure and connectivity, local finance, production and innovation dynamics, and quality of local government institutions, among other things. Every place is different, and working with microdata (individual-level and firm-level data) as opposed to aggregate data allows us to creatively study patterns and dynamics. In contrast to country-level data availability, data that allows for the study of sub-national areas is far less available. This is especially true in countries that have traditionally lagged in statistical data collection. These are precisely the places that can benefit most from a growth diagnostic.
Back to our task in central Albania, in the best case scenario, we would have gained access to the full-count business register, the full-count Census of the Population; microdata of a Labor Force Survey or a Household Survey, as well as transaction-level data of firms sales and purchases. Instead, we only gained access to aggregated data from these sources at the municipal and industry level.
What changed by 2025?
Providers of company-level data have proliferated in recent decades. The quality and coverage of such data have increased, making it possible to gain insights into places previously outside the scope of such providers. In our case, we worked with three business registers with information on Albania-based establishments: OpenCorporates, Dun & Bradstreet (D&B), and Crunchbase. Some 30,000 were present in OpenCorporates, and 9,000 in D&B. Crunchbase, which mainly covers start-up firms, had fewer than 500 establishments. Gaining access to Moody’s Orbis was difficult and expensive. Additionally, to study R&D and innovation, we accessed patent records, and scholarly work through OpenAlex.
Another perk of working in 2025 was working with young professionals who knew and/or quickly learned how to employ API, geocode, use machine learning and natural language processing techniques in order to make the most out of the data. This allowed us to turn free-text business descriptions and unconventional industrial classifications into standard classifications that make comparisons with aggregated official data possible. If we would have had more time, we could have also analyzed satellite imagery to assess the state of infrastructure and map certain industries such as mining and hydro dams.
The above helped us understand the structure of the economy of Albanian municipalities –industry presence and specialization, economic geography–and the innovation dynamics. We also understood travel patterns and access to and pricing of different travel modes. However, all this fell short in helping us understand demographic developments, the state of human capital and the municipal labor market. Neither could it tell us anything about safety, corruption and the functioning of local government institutions. Hence, we combined the above insights with that from official government reports, news articles, and reports by international organizations active in Albania. This helped us create a more complete picture of the municipal economy.
Where administrative data has no parallel
In spite of the inroads we made, currently there is no match for data collected by governments, either through administrative collections–tax, customs, social security, business registers, or through Censuses and representative surveys–Labor Force Surveys, Household Surveys, Survey of Businesses. This is particularly true for places that could benefit the most from analyses of their economic potential. When the US government cancelled the jobs report in October this year, private data providers such as LinkedIn, ADP and Revelio Labs filled the gap. This is not yet possible in less developed places, where job matching mainly takes place outside the digital domain and where the informal economy is sizable. Here, the government has a monopoly over business and labor market data.
Public business registers are incomplete, and proprietary ones are even less complete. The largest public register (OpenCorporates) only covers less than a quarter of all active businesses in Albania. Widely used and pricey databases such as DnB covered 7 percent of all businesses. Moreover, unlike government registry data, these databases seldom include longitudinal data on companies, limiting critical analysis of business dynamics. Various datasets provided by the World Bank, such as the enterprise survey and the STEP survey, are excellent for national-level analysis, but are not representative and not large enough for sub-national analysis.
Where do we go from here?
Governments do not have incentives and resources to make microdata available for purposes of research or economic analysis. The cost of preparing micro-data for purposes of research is large. It involves efforts to anonymize, document, and clean data, enable secure data access, track use and prevent abuse. When in use, users need technical support. Moreover, the process carries risks for these offices, but the benefits go to the users and the beneficiaries of the analysis. In other words, the incentives for making government microdata available for research purposes of outside research institutions are not there.
There are two ways to circumvent the incentive misalignment - build internal research capacity or make external users pay for data access and the risk to data use. Admittedly, not all countries can achieve either of these, but those where scale justifies investment, should aim to develop one of these capacities.
At the same time, the improvement in LLMs’ capacities and their availability at low cost hold the promise of breaking barriers to data access–they lower language frictions by providing instant translation, and lower the frictions caused by different reporting standards by helping the users compare apples to apples. They help assign quantitative properties to qualitative data, allowing better measurement and comparison, which expands the scope for doing diagnostics.
It is likely that five or ten years from now significantly more data on individuals, jobs, and business in places like Albania will be digital and accessible to researchers and analysts. This however, may not be a substitute for close collaborations with governments and statistical offices.