header1   header
header
header Register : : Login header
header
connector   connector
menuleft menuright
submenu   submenu
left
Querying a CSV file
Last Post 03 Mar 2010 10:38 AM by PaulB. 5 Replies.
Printer Friendly
  •  
  •  
  •  
  •  
  •  
Sort:
PrevPrev NextNext
You are not authorized to post a reply.
Author Messages
PaulBUser is Offline
New Member
New Member
Posts:19
Avatar

--
02 Mar 2010 01:05 PM
    Evening,

    I'm looking for some advice on the best / most efficient way of querying information in a CSV file. The file has around 30,000 rows and 7 columns, each row is already unique based on the first column value. I can use Import-CSV to bring the data in but not sure how to iterate through the data based on second CSV (200 rows, single column).

    So to expand, the first large file contains distinct domain records containing information on the connection status, i.e. success, failure, connections, messages, etc. The second smaller CSV is a list of domains. What I'm attempting to do is get the detailed information for the small list of domains based on the detailed infomration in the the all domains list.

    So far I've looked at iterating the smaller CSV and using the input as a constraint for where-object whilst moving over the larger, detailed CSV. But this seems very inefficient as I'm repeating this 200 times, re-reading all 30,000 objects). I've also looked at placing the 30,000 objects into an associative array where the domain is the key and the value is an array of the other details. However, building the associative array takes forever.

    Any suggestions?

    MT

    Paul
    glnsizeUser is Offline
    Basic Member
    Basic Member
    Posts:193

    --
    02 Mar 2010 05:11 PM

    You could try group-object I've used it in the past to collate large sets of data.  Never 30,000 that's just gonna take a long time to process!

    $domains = Import-Csv ./some.csv            
                
    Import-CVS ./ReallyBig.csv |             
        Group-Object Name |             
        Where-Object {$domains -like $_.Name} |             
        Select-Object -ExpandProperty Group

    You may want to look into just ingesting the data into a simple database.

    ~Glenn

    PaulBUser is Offline
    New Member
    New Member
    Posts:19
    Avatar

    --
    02 Mar 2010 11:15 PM
    I did consider whether I should use dotNet and datasets but was looking for other peoples opinion on the best way to do this task
    George HowarthUser is Offline
    Basic Member
    Basic Member
    Posts:360
    Avatar

    --
    03 Mar 2010 02:14 AM
    If you're optimizing for performance, the easiest way would be something like this (if my assumptions are correct):

    Use a StreamReader to read each line in the smaller CSV and populate an ArrayList with the domains
    Load the larger CSV with Import-CSV
    For each object in the larger CSV, compare the object's key with each element in the array of domains. If a match is found, select that object, remove the domain just found from the ArrayList to decrease the size of the array for the next comparison, then break from the current iteration to prevent the object's key being compared with the remaining domains in the array.

    Translated to code, that would look something like this:

    $domains = New-Object System.Collections.ArrayList
    $reader = New-Object System.IO.StreamReader -ArgumentList "myDomainsCSV.csv"

    while ($reader.Peek() -ge 0) # While not end of file
    {
    $domains.Add($reader.Readline())
    }

    $reader.Close()

    $domains.RemoveRange(0, 1) # Remove the first element because it is the column header

    $matchingRecords = New-Object System.Collections.ArrayList
    $records = Import-CSV -Path "myBigCSV.csv"

    foreach ($record in $records)
    {
    foreach ($domain in $domains)
    {
    if ($record.Key -eq $domain)
    {
    $matchingRecords.Add($record)
    $domains.Remove($record) # Remove the domain that has just been found from the ArrayList, as it is no longer needed and optimizes the next iteration
    break # Ignore remaining domains (if any) and process next record
    }
    }
    }

    That would remove a lot of unneccesary iterations from the process.
    PoSherLifeUser is Offline
    Basic Member
    Basic Member
    Posts:364
    Avatar

    --
    03 Mar 2010 06:40 AM
    Filter it as the data is being imported. As soon as it hits a record that is already in $domains it continues to the next line of the large file. No extra breaks or reads to be done. No re-iterations after the data is imported either. Also, notice that $domains is a simple array of strings for the domains csv file. This will help with speed and keep the script less complex.

    $domains = Get-Content Domains.csv
    $records = Import-Csv AllRecords.csv | Where {$domains -contains $_.Domain}
    $records
    When at first you don't succeed Step-Into

    http://theposherlife.blogspot.com
    http://www.jandctravels.com

    PaulBUser is Offline
    New Member
    New Member
    Posts:19
    Avatar

    --
    03 Mar 2010 10:38 AM
    Wow simple and fast. I'd not thought of using 'where' like that

    Thank you very much

    Paul
    You are not authorized to post a reply.


    Active Forums 4.3
    right
    footer   footer
    footer Sponsored by Quest Software • SAPIEN Technologies • Compellent • Microsoft Windows Server 2008 R2 footer
    footer   footer