In most cases you start out with one or more sources of basic user identities. These are the canonical truths about who actually works for your company. In most cases it includes a human resources system of some kind. If this system is connected to payroll it tends to contain very good data as the employees tends to complain if they don't get payed and the company don't like to continue paying people that no longer works for the company. In some cases you will discover that the HR system is only linked to payroll in some parts of the world while other, i.e. the UK office, uses another system to feed payroll. This usually results in the data in the HR system being less well maintained which can cause serious issues.
In many cases there are entities that needs to go into the IDM system that aren't present in the HR system i.e. contractors. Getting hold of data about these users is often not easy but there may be a contractor database somewhere. Worst case you may have to settle with data from the security badging system or from the corporate active directory. Even when you find basic information about the contractors you will often discover that the data quality can be very bad. Information such as manager id, current status or end date may not necessarily be well maintained. If you for example are planning to send warning emails to the manager of the contractor it will not be good if all 200 contractors in the manufacturing division reports to the division VP.
Assuming that you managed to discover your base identities the next step is to identify what target system accounts belongs to the base identities. In a perfect world there should be a unique identifier in each target system account such as an employeeid that can be traced back to one and only one account in the trusted source (i.e. the HR system). In practice this is rarely true. In most cases some target system accounts contains the unique identifier while a large percentage will need to be linked using less exact methods such as email addresses or in worst case names. Name based linking can be very time consuming and there is also a substantial risk that you will end up with false matches.
There are many tools available that will make the linkage process and if your account volume is decently high you want to start by doing automated matching using some form of script or program. Once you have cleared out the obvious matches you may want to switch over to a manual process that utilizes a simple Excel sheet to match between trusted source accounts and target system accounts.
If you can use a divide and conquer approach and divide the trusted source accounts and the target system accounts into distinct buckets things gets much easier. Lets say you have 3000 unmatched trusted source accounts and 5000 unmatched target system accounts you want to investigate if you can divide the accounts into ten country buckets based on the country attribute in both the trusted source account and the target account. This will reduce the problem to ten instances of matching 200-400 trusted accounts to 300-700 target accounts which is a much easier problem to solve. This approach of course assumes that there is a suitable "bucketing" attribute available in the target system as well as in the trusted source.
To sum things up preparing for the initial load essentially consists of the following steps:
- Discovering the trusted sources that will provide the base identities
- Extracting the user data from the trusted sources
- Cleaning or at least evaluating the quality of the data contained in the trusted sources
- Discovering the target system accounts
- Linking the trusted identities to the target system accounts
If the data is in good shape this can be a quick process but if the data is bad it is not that uncommon that you need to spend 3-6 months on the data cleanup. It is therefor a good idea to include a data clean up and correlation thread in your IDM program that starts at the same time as you kick off the provisioning implementation project.