Discussion on Methods to Identify Institutional Investors from Messy Data

whitepaper
presentation
Author
Affiliation

Nicholas Polimeni

Urban Research Group, Georgia Institute of Technology

Published

April 12, 2024

Introduction

This is a written version of a presentation I gave to a working group of researchers studying the impact of institutional investors. The session was organized by Anthony Damiano, PhD and hosted by the Center for Urban and Regional Affairs at the University of Minnesota Twin Cities.

Session Recording and Presentation Slides

A need for accessible and robust methods to identify institutional investors, especially single-family rental (SFR) investors, motivates this discussion. Considering the lack of rental registries in the United States, researchers must bear significant technical burdens in identifying these investors from messy parcel and sales records. The challenges include, but are not limited to: inconsistent naming, corporations with multiple business addresses, and difficulty in acquiring data, especially historical data.

Note

I refer to algorithms that identify institutional investors as entity resolution (fuzzy matching) or clustering algorithms, depending on which I am referring to. I use these terms to reflect the general desire to group same-owners (such as subsidiary corporations) within the data, even if the exact definitions of these algorithms differ. Classification algorithms may also be relevant, but we generally do not have a complete list of investors to supervise learning. I discuss machine learning (clustering, classification, and deep learning) algorithms in more detail later.

Tradeoffs

We should first analyze the fundamental challenges of this topic. All methods must make a trade-off between accuracy, runtime, and technical complexity.

Accuracy

For accuracy, we can prioritize an absolute number of matches, percent correct matches, percent true matches vs false matches, or some combination of these criteria.

Generally, most situations aim to maximize the number of correct matches without crossing a threshold for percent false matches. Essentially, we want as many matches as possible and most of them need to be correct.

Note

While accuracy is most important, an ideal algorithm should be precise or otherwise invariant to minor changes in data format. Some data may contain, for instance, street postfix abbreviations (e.g. “STREET” or “ST”) within the owner address whereas others will not. An ideal algorithm should retain its accuracy and precision regardless of these differences.

Runtime (Time Complexity)

For execution time or runtime, needs vary. For practitioners, waiting days for an algorithm to run is not feasible. Researchers are generally more willing to consider these algorithms. We must also understand the limitations of the hardware each party has access to.

In this context, there are three potential target categories of runtime:

  1. Good algorithms that can be run quickly and easily on any hardware,
  2. Better algorithms that experienced professionals can run if necessary,
  3. Optimal algorithms that may require offsite, vast infrastructure.

It is unlikely that the marginal improvement from category 2 to 3 is worth the additional effort. Training a deep learning model may be an exception, as it only requires extensive infrastructure and expertise once during training. We will discuss this further in the following sections.

Technical and Implementation Complexity

Technical complexity is often related to algorithmic efficiency; however, some fast algorithms might be very difficult to implement. For instance, concurrent processing may make an algorithm fast enough for practitioners to use, but they cannot be expected to implement this themselves.

This issue can be resolved with the creation of programmatic tools or applications that facilitate the use of these algorithms.

Theoretical Foundations: Time Complexity

Let us further analyze algorithm efficiency, otherwise known as time and space complexity. This will help us compare potential approaches later.

We are generally more concerned with time complexity. It is relatively cheap to buy RAM but impossible to buy time.

We can categorize algorithms with Big-O notation; we write time complexity as O(f(n)), where f(n) is an upper bound function for the runtime of the algorithm given input of size n.

Runtime is not how long an algorithm takes to run on any individual computer, but rather the number of operations in the worst case. We focus on how quickly the runtime of an algorithm grows with respect to its input size.

Graph showing the common Big-O notation functions for runtime complexity (“BigO Cheat Sheet” 2013).

With this understanding, let us analyze the runtime of algorithms related to entity resolution and clustering.

Time Complexity of String Comparisons

These comparisons are the building blocks of entity resolution algorithms.

Vectorized Comparison (Equals Operator): O(1)

This is an exact comparison between two encoded values (in this case, strings). A computer can take two binary encodings from memory and compare them exactly with a single operation.

Example

Both "A" =? "A" and "1 MAIN ST" =? "1 MAIN ST" take approximately the same time to execute, despite longer strings in the second group.

Similarity Metrics or Other Iterative Comparisons: O(n)

These are metrics that calculate a similarity score based on aspects of two strings. Many string distance metrics calculate the “distance” between corresponding letters in each string. This requires a comparison for each letter in the shortest string.

Example

Running Levenshtein Distance, a common string distance metric, on "A" =? "A" and "1 MAIN ST" =? "1 MAIN ST" takes 1 and 9 operations to execute, respectively.

Time Complexity of Entity Resolution Algorithms

Aggreggate / GroupBy: O(n)

These functions utilize the same efficiencies of vectorized comparison by hashing values and placing them into blocks/buckets.

Example (Python Pandas)

data.groupby("owner_address")

Similarity Threshold Grouping: O(n^2)

These functions typically compare each string to every other string using a similarity metric, grouping strings that are above a certain threshold.

Runtime may be improved with blocking or concurrency; OpenRefine’s clustering function appears to use an optimized implementation.

Example (OpenRefine)

Please refer to (An et al. 2024)

Time Complexity of Machine Learning Algorithms

K-Nearest Neighbors (Clustering, Unsupervised): O(n^2)

This algorithm encodes strings as vectors in a multi-dimensional space. Then, it compares each vector to every other vector using a vector distance metric (like cosine similarity), finding the k closest vectors, where k is user-defined. It is the foundation of clustering algorithms.

Some specialized versions of this algorithm, like metric trees, are more optimized.

KNN visualization (S. H., n.d.)

Approximate K-Nearest Neighbors (Clustering, Unsupervised): generally at or below O(n)

This is a modified form of K-Nearest Neighbors that trades perfect accuracy for speed. It uses an efficient algorithm to split the vector space into subspaces. Rather than comparing a vector to every other vector, it utilizes the subspaces to quickly identify where the most similar vectors likely are.

Example ANN Algorithms
  • FAISS
  • Locality-Sensitive Hashing (LSH)
  • Hierarchical Navigable Small World (HNSW)

Visualization of subspaces in ANN (Bernhardsson, n.d.)

Deep Learning (Classification, Supervised):

Deep learning uses a series of subsequent layers to transform data into a final layer (categories). Within each layer, there are nodes which each contain a separate activation function. The model is trained by feeding input data and comparing its output to the desired output; the nodes are modified with each data point to optimize the result.

Training a deep learning model typically requires dedicated GPUs, but once the model is trained, it can be used much more efficiently.

deeplearning

Simplified deep learning architecture. Source.
*Note on Runtime Complexity

Runtime complexity depends on the implementation. However, the entire dataset will not be used during training, and training only needs to occur once, not every time the model is used.

Current and Potential Approaches

Simple Address Key

In my paper with Brian An, Uncovering Neighborhood-level Portfolios of Corporate Single-Family Rental Holdings and Equity Loss, our priority was developing a simple method to accurately quantify ownership scale, rather than identify the portfolios of specific owners (Polimeni and An 2024). However, this is possible with an additional querying method. This section details these methods.

Motivation and Design

We rely on a modified version of the owner address to group the same-owners together. Owner address has a lower potential for false positive matches than owner name while also being easier to standardize and correctly match. Moreover, owner name has greater potential for inconsistency between subsidiary corporations, but many subsidiary or shell SFR corporations use the same address. In this way, we account for many cases of nested ownership structure, where different holding corporations may own properties, but report the same business address for records purposes. We favor this approach as it narrows focus to our research question without opaque or complex methodological underpinnings.

In particular, by using an encodable key for each address, this method takes advantage of vectorized comparison. Therefore, it can use an aggregate or groupby operation for a runtime complexity of O(n).

Method Details

For each parcel, we construct an owner address key by concatenating the address number, address string (street name or PO Box), and zip code. Because the address string excludes postfixes, we can avoid some missed matches due to inconsistencies between labeling, such as “STREET” vs. “ST”. Before creating the key, we eliminate various other inconsistencies by removing periods, commas, multiple spaces, and uppercasing all characters. Finally, to resolve inconsistencies for PO boxes, we can take any address string with at least one number, extract the numbers, and postpend them to the string literal “PO BOX”.

Table 1. Examples of Owner Address Key Procedure
Address Number Address String Zip Code Address Key
52 CREEKSIDE PARK 30022 52 CREEKSIDE PARK 30022
NA (filled with 0) P.O. BOX 370049 30037 0 PO BOX 370049 30037
NA (filled with 0) P O Box 370049 30037 0 PO BOX 370049 30037
Table 2. Sample of Corporate Owners and Associated Subsidiaries
Owner Address Key Sample of Associated Names Common Name
5001 PLAZA ON THE 78746 ALTO ASSET COMPANY 2 LLC, EPH 2 ASSETS LLC, BAF 1 LLC Amherst Residential
1850 PARKWAY 30067 FKH SFR PROPCO D L P, CERBERUS SFR HOLDINGS LP FirstKey Homes (Cerberus Capital)
PO BOX 4090 85261 HOME SFR BORROWER IV LLC, PROGRESS RESIDENTIAL BORROWER 15 LLC Progress Residential
1717 MAIN 75201 2018 3 IH BORROWER LP, 2018 2 IH BORROWER LP Invitation Homes
591 PUTNAM 6830 STAR 2021-SFR2 BORROWER L P, STAR 2021 SFR2 BORROWER LP Colony Starwood Homes
Querying

Motivation

Due to the possibility that the same corporation may use multiple addresses, the address key method is not intended to robustly describe the portfolios of specific corporate owners. However, it is possible to query the resulting data (in this example, we label the table ownership scale) for this information.

Steps

  1. Produce a owner table by aggregating on address key
  2. Create a list of substrings associated with the desired owner (for instance, “AMHERST” and “ARVM” for “Amherst”).
  3. Identify rows where at least one associated owner name contains the substring
  4. Optional: apply a threshold to avoid small, unrelated owners / individuals

*A threshold may help prevent accidental matches. For instance, if a query produces three owners each with over 50 properties owned, and another with only one property owned, it is likely the latter is not associated with the former.

Note

In the case that a complete list of substrings for each owner is not known, it is possible to begin querying with the base name or other substrings that are known. From this result, add more potential substrings.

For instance, only query with “Amherst”, then look through the resulting associated names. This is why we added “ARVM” as a substring key. In our experience, most substrings or other types of patterns appear for multiple addresses, so it is possible to quickly identify the most common substrings.

Code Example (Python)

# Keywords to query for each owner
# For short strings, like "IH" that might accidentally appear,
# a space is added to reduce this possibility.
owner_keywords = {
    "Amherst": ["AMHERST", "ARVM"],
    "Cerberus": ["CERBERUS", "FKH", "RM1 ", "RMI "],
    "Progress": ["PROGRESS", "FYR"],
    "Invitation": ["INVITATION", "IH "],
    "Colony": ["COLONY", "STARWOOD", "CSH", "CAH "],
    "Sylvan": ["SYLVAN", "RNTR"],
    "Tricon": ["TRICON", "TAH"]
}

# Query for each owner described above
for owner in owner_keywords:
    query_str = "|".join(owner_keywords[owner])

    # Find rows for TAXYR 2020 and where at least one associated name
    # contains a keyword matching the owner
    owned_by_given_corp = owner_scale[
        (owner_scale["TAXYR"] == 2020) 
        & owner_scale["assoc_owner_names"].apply(
            lambda x: any(((re.search(query_str, name)) for name in x)
        ))
    ]

    # Only retain matched subsidiary owners over a threshold,
    # this is to prevent false positive matches. For instance,
    # if a single parcel was owned by someone with the last name "SYLVAN"
    owned_by_given_corp = owned_by_given_corp[
        owned_by_given_corp["count_owned_fulton_yr"] > 49
    ]
    
    # Sum total over all matched subsidaries
    total_owned = owned_by_given_corp["count_owned_fulton_yr"].sum()
Table 3. Example Output: SFR Ownership Scale of Top 5 Corporate Landlords in Fulton County in 2020
Name Parcels Owned in Fulton (Address Key Query) OpenRefine (OR) method (An et al. 2024) Net Diff. OR method + manual review (An et al. 2024) Net Diff.
Amherst Residential 720 103 +617 750 -30
Invitation Homes 677 524 +153 719 -42
Progress Residential 619 457 +159 760 -141
Sylvan Realty (RNTR) 422 250 +172 433 -11
Tricon Residential 219 222 -3 280 -61
Cerberus Capital 256 340 -86 349 -93
Starwood Capital 122 361 -239 450 -328
Auto-Querying with NLP

Rather than manually building a set of substrings for each corporation, it may be possible to start with a single substring (say, “Amherst”) and find all associated owner names based on the address key method. Then, using an NLP technique, automatically identify the most unique substrings (“ARVM” should appear as this is an unusual combination of letters) and query for those.

Application to Research

We utilized this method to create a straightforward framework quantifying equity loss from communities due to institutional investors (Polimeni and An 2024).

The analysis was conducted on an extensive dataset of parcel and sale records from 2010 to 2022, revealing $1.25B in lost financial equity in Atlanta’s neighborhoods, concentrated in predominantly African American neighborhoods.

Relative equity loss due to SFRs in Atlanta’s Neighborhood Statistical Areas (Polimeni and An 2024).

Advanced Address Key

The Simple Address Key has difficulty with precision; on differently formatted data, it may struggle. For instance, if “ST” or “STREET” are included in the address string. It also does not account for different SUITE numbers. Additionally, there may be other minor inconsistencies due to unforeseen issues like spelling or word order which are impossible to clean.

A more complex address key can be formulated with this in mind, but both the technical complexity and runtime complexity increase.

The simplest form of an Advanced Address Key extracts the address number, the suite number, the zip code, and the last two letters of the longest substring in the street address. Such features are almost always invariant to changes in spelling, word order, street postfixes, or other inconsistencies.

Address number and suite number can be identified as the first and last series of numbers, separated by spaces. A REGEX pattern can be used to extract these numbers.

Table 4. Example Advanced Address Key Matching
Address 1 Zip 1 ADDR KEY 1 Address 2 Zip 2 ADDR KEY 2 Matched?
PO BOX 490734 30363 490734-0-OX30363 P O. BOX 490734 30363 490734-0-OX30363 Yes
3505 KOGER BLVD 400 30315 3505-400-ER30315 3505 KOGER BLVD., SUITE 400 30315 3505-400-ER30315 Yes
One Buckhead PL STE 300 30305 1-300-AD-30305 One Buckhead PL STE 325 30303 1-325-AD-30305 No
5 PEACHTREE ST 30308 5-0-30308 5 PEACHTREE ST 30354 5-0-30354 No
Improvements with Business Registry Data

Availability of usable state business registry data may vary, so I have not included it in the base algorithm. However, if available, it has the potential to aggregate same-owners that use multiple addresses. The exact implementation may vary with different data schemas.

For instance, an owner might record multiple owner addresses for different parcel records but have a single business address registered with the state. This offers the possibility of exactly matching the owner name to business registry data, although this has the same challenges of spelling and naming inconsistencies.

In some cases, business registry data may list owner, or beneficial, corporations and their addresses. These can be used to aggregate subsidiary addresses and names at the state level.

Regardless, considering data inconsistency, matching even good data from business registry records to parcel or sales records requires the use of fuzzy matching algorithms discussed here.

Crowdsourcing Rental Registries

Using an address key approach, we can construct a list of owners associated with a business address. To create a comprehensive list that identifies the portfolios of specific investors, we need to aggregate these lists with querying (see Simple Address Key querying).

If researchers complete this step within their own metro areas, we can combine this data to create a robust, crowdsourced ownership database. This database may consist of two tables, one linking address keys to an owner index, and another linking owner indexes to all associated corporate names. With this method, we can efficiently create a linkage between all of a corporate owner’s addresses and corporate names.

Table 5. Example Schema of Ownership Database (Addresses)
Address Key Owner Index
5001 PLAZA ON THE 78746 1
9800 HILLWOOD 76177 2
1850 PARKWAY 30067 2
Table 6. Example Schema of Ownership Database (Associated Names)
Owner Index Associated Names Common Name
1 ALTO ASSET COMPANY 2 LLC, EPH 2 ASSETS LLC, BAF 1 LLC Amherst Residential
2 FKH SFR PROPCO D L P, CERBERUS SFR HOLDINGS LP FirstKey Homes

Appendix C of (Polimeni and An 2024) contains this data for Fulton County, GA.

String Similarity Metrics

While key methods are ideal for applications that require fast computation or lack computational power, some use of string similarity metrics may be advantageous for those with the capability.

The use of string similarity metrics becomes more useful for ambiguous cases, or for grouping owners with similar names but different addresses. The efficiency may also be increased by blocking; for instance, by using a vectorized aggregation to create buckets of owners with at least one matching word substring; doing so reduces the number of necessary comparisons exponentially.

However, most string distance metrics are not optimized for address strings or corporate owner names. They are typically best for spelling inconsistencies. Therefore, developing a string distance metric optimized for corporate names has potential.

My experience indicates that such a string distance metric should prioritize token comparison over letter comparison.

For instance, “2018 3 IH BORROWER LP” and “HOME SFR BORROWER IV LLC” have many letters in common, but they are different entities. More weight should be placed on the substrings, like “IH” and “HOME”, since common terms like “SFR” and “BORROWER” appear frequently.

This warrants further investigation as the most common string distance metrics have a high number of false positives in our use case.

Clustering (KNN or ANN)

Clustering requires that the data can be vectorized. We can encode owner names string into vectors using a string embedding model. However, these models are generally intended to encode semantic meaning. They are not useful for identifying strings that look similar.

Address keys are also vectorized, but we do not gain anything from using a clustering technique versus a simple aggregation.

Clustering only has potential if we consider other dimensions in the data; for instance, if we want to include property characteristics. For instance, the same owner might own many properties in the same geographic area. However, this method may cause false positives.

Clustering is likely not a solution to our problem.

Deep Learning

Deep learning requires a correct, labeled dataset to train on. This takes additional effort to create. Furthermore, there are many instances of subsidiary corporations with no discernable pattern (see more in the next section).

Deep learning is also likely not a solution.

Is This a Solvable Problem? If Not, What is the Best Solution?

Ultimately, the problem is solvable from a computational perspective (there exists a polynomial time algorithm), but it is not perfectly solvable from a data perspective.

The challenge comes from cases where both the owner names and addresses are different but they represent the same owner. For instance, how is it possible that any algorithm or model can correctly classify “Jeff 1 LLC” as a subsidiary of Amherst Capital if the owner address is different?

Alternatively, how can an algorithm distinguish “HOME BORROWER I LLC” from “HOME SFR BORROWER LLC” if these are owned by different corporations? Or, if they are owned by the same corporation, but use different addresses, we run into the same issue.

Table of all ownership possibilities (Polimeni 2023)

If we list out all possible combinations, we see that some are difficult or impossible to capture. There are no string distance metrics that will resolve this issue and no discernable pattern for deep learning.

Without rental registries, or some other complete ground truth, I argue that a perfect algorithm is not possible. It may not even be worth the time to aim for this. The best solution is likely an optimized fuzzy address key, coupled with additional data steps to aggregate specific portfolios. This solution finds an ideal balance between accuracy, runtime, and implementation complexity. Additionally, it generally avoids false positives, which is ideal for most purposes.

A Methods Paper to Reduce These Challenges

For each research question that necessitates the classification or identification of property ownership, researchers must spend valuable time wrangling with these challenges. Practitioners can also gain valuable insights from this analysis but lack the resources or knowledge to overcome these barriers.

To reduce the burden on researchers and increase the speed at which the impact of SFR investment can be studied, a methods paper should concretely compare all current methods.

The goal of this speculative project is to produce:

  • A sample of labeled data (ground truth) across multiple metro areas to train and/or test models against. This is related to the data crowdsourcing mentioned previously.
  • A paper comparing all (and potential) methods, highlighting the tradeoffs for each, and which methods are best for under specific circumstances.
  • Optimized coding tools or an application to facilitate the best procedures.

Conclusion

Until rental registries are established, these challenges will continue to limit actionable and evidence-based policymaking for the housing market. It is essential that we quickly uncover significant evidence to urge policymakers to act before corporations find new ways to obfuscate ownership.

Q&A from the Live Session

Will be recorded here after the conclusion of the session.

References

An, Brian, Andrew Jakabovics, Anthony W. Orlando, Seva Rodnyansky, and Eunjee Son. 2024. “Who Owns America? A Methodology for Identifying Landlords’ Ownership Scale and the Implications for Targeted Code Enforcement.” Journal of the American Planning Association 0 (0): 1–15. https://doi.org/10.1080/01944363.2023.2292674.
Bernhardsson, Erik. n.d. “Nearest Neighbors and Vector Models.” https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html.
“BigO Cheat Sheet.” 2013. https://www.bigocheatsheet.com/.
Polimeni, Nicholas. 2023. “An Algorithmic Approach to Identify Parcel Ownership with Publicly Available but Messy Data.” https://www.nicholaspolimeni.com/resources/acsp_handout.pdf.
Polimeni, Nicholas, and Brian An. 2024. “Uncovering Neighborhood-Level Portfolios of Corporate Single-Family Rental Holdings and Equity Loss.” https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4748871.
S. H., Roshna. n.d. “K-Nearest Neighbors Algorithm.” Intuitive Tutorial. https://intuitivetutorial.com/2023/04/07/k-nearest-neighbors-algorithm/.