Use the following sources to obtain text samples:

Data tasks:

  1. Select approximately 150-200 text samples from each of the above sites.
    1. Take text samples from the beginning of each article/webpage (abstracts or first couple paragraphs).
    2. Take samples of approximately 100-250 words.
    3. The kids’ sites may not have 150 separate articles available; if not, no problem, just get all you can, keep the other samples roughly balanced.
  2. Create a spreadsheet/database record for each text sample, including the following data:
    1. DocID (unique numerical ID for each sample: 1, 2, 3,… )
    2. Source ID
    3. Complete URL
    4. Text sample
  3. Save each text sample as a separate text file.
    1. Filename: <DocID>.txt
    2. Final directory should consist of a series of text files named 1.txt, 2.txt, 3.txt, … , 998.txt, 999.txt, ….

Schedule (meet 11:30 each day):

  • Thursday, March 5: first 50 samples

  • Tuesday, March 10: samples 51-250
  • Thursday, March 12: samples 250-500
  • Tueday, March 16: samples 501-1000

Advertisements