viernes, 3 de octubre de 2014

Obtener y limpiar texto de la Web empleando 'nltk' (Python)

Notebook Veamos un ejemplo de cómo emplear el paquete 'nltk' de Python para bajar documentos de la red y limpiarlos para, posteriormente, hacer análisis con el contenido. Ofrecemos un ejemplo muy breve de una análisis de concordancia, una vez que hemos limpiado el texto
In [1]:
%pylab inline
import nltk
from nltk.corpus import PlaintextCorpusReader
Populating the interactive namespace from numpy and matplotlib

In [6]:
# import los paquetes necesarios para trabajar html

from urllib import urlopen
In [14]:
# crear la dirección desde la que obtendremos el corpus desde la web
pop = "http://www.foreignaffairs.com/articles/141191/cynthia-j-arnson-and-carlos-de-la-torre/viva-el-populismo"
In [15]:
# bajar el archivo
populismo = html = urlopen(pop).read()
In [16]:
# verificar el tipo de data correspondiente a, en este caso, populismo
type(populismo)
Out[16]:
str
In [17]:
# inspeccionar la extensión del archivo bajado
len(populismo)
Out[17]:
68394
In [29]:
# escoger al azar un subtramo de la cadena de texto
populismo[120:2380]
Out[29]:
'rs \n  \n \n \n \n \n \n \n\n \n \n \n \n\n \n \n \n \n \n \n \n\n\n\n\n \t \n \n \n\n  \n  \n  \n\n   \n  \n  \n  \n  \n  \n  \n \n    \n  \n\n \n\n \n \n  \n  \n   \n   Skip to Navigation \n\n     \n    \n \n \n\n \n \n  \n \n\n \n \n \n\n \n \n  \n \n\n \n  \n   \n  \n   \n\n      \n\n        \n      \n      \n     \n     \n    \n            \n     \n     Foreign Affairs     \n     \n    \n    \n      \n      \n    \n \n \n\n \n \n  \n  Home \n International Editions \n Digital Newsstand \n Job Board \n Account Management \n RSS \n Newsletters \n \n \n\n \n \n \n\n \n \n  \n \n \n \n \n \n \n\n \n \n\n \n \n \n\n \n \n  \n  Login \n  Register \n   My Cart \n  \n\n \n        \n      \n\n   \n\n      \n\n    \n\n    \n    \n    \n    \n \n \n\n \n \n  New Issue \n Archive \n Regions Africa \n Americas \n Asia \n Europe \n Middle East \n Russia & FSU \n Global Commons \n \n Topics Economics \n Environment \n Security \n Law & Institutions \n Politics & Society \n U.S. Policy \n \n Features Snapshots \n Letters From \n P.S. \n Reading Lists \n Comments \n Essays \n Responses \n \n Discussions Interviews \n Roundtables \n Letters to the Editor \n News & Events \n \n Video \n Books & Reviews Review Essays \n Capsule Reviews \n FA Books \n \n Classroom \n About Us Submissions \n Staff \n Employment \n Advertising \n Sponsored Sections \n Contact Us \n History \n \n Subscribe \n \n\n \n  \n       \n\n   \n\n   \n   \n       \n     Home \xe2\x80\xba Features \xe2\x80\xba Snapshots \n    \n         \n           Viva el Populismo? \n           The Tense Future of Latin American Politics \n          \n             \n                \n      By Cynthia J. Arnson and Carlos de la Torre      \n                \n       CYNTHIA J. ARNSON is director of the Latin American Program at the Woodrow Wilson International Center for Scholars. CARLOS DE LA TORRE is director of international studies and professor of sociology at the University of Kentucky, Lexington. They are the editors of Latin American Populism in the Twenty-First Century (Woodrow Wilson Center Press and The Johns Hopkins University Press, 2013), upon which this essay draws. \n See more by Cynthia J. Arnson See more by Carlos de la Torre      \n     \n           \n       April 16, 2014      \n          \n    \n         \n      Venezuelan President Nicolas Maduro waves to supporters during a campaign rally on April 6, 2013 (Courtesy Reuters)      \n    \n\n               \n   \n   \n    \n    \n         \n\n \n \n \n'

Lo que tenemos es la página cruda. Debemos limpiarla para poder extraer toda la información que necesitamos. El paquete 'nltk' tiene una función que nos permite hacer rápidamente esta limpieza:

In [21]:
populismo = nltk.clean_html(populismo)
In [28]:
populismo[1:2380]
Out[28]:
'ynthia J. Arnson and Carlos de la Torre | The Tense Future of Latin American Politics | Foreign Affairs | Foreign Affairs \n  \n \n \n \n \n \n \n\n \n \n \n \n\n \n \n \n \n \n \n \n\n\n\n\n \t \n \n \n\n  \n  \n  \n\n   \n  \n  \n  \n  \n  \n  \n \n    \n  \n\n \n\n \n \n  \n  \n   \n   Skip to Navigation \n\n     \n    \n \n \n\n \n \n  \n \n\n \n \n \n\n \n \n  \n \n\n \n  \n   \n  \n   \n\n      \n\n        \n      \n      \n     \n     \n    \n            \n     \n     Foreign Affairs     \n     \n    \n    \n      \n      \n    \n \n \n\n \n \n  \n  Home \n International Editions \n Digital Newsstand \n Job Board \n Account Management \n RSS \n Newsletters \n \n \n\n \n \n \n\n \n \n  \n \n \n \n \n \n \n\n \n \n\n \n \n \n\n \n \n  \n  Login \n  Register \n   My Cart \n  \n\n \n        \n      \n\n   \n\n      \n\n    \n\n    \n    \n    \n    \n \n \n\n \n \n  New Issue \n Archive \n Regions Africa \n Americas \n Asia \n Europe \n Middle East \n Russia & FSU \n Global Commons \n \n Topics Economics \n Environment \n Security \n Law & Institutions \n Politics & Society \n U.S. Policy \n \n Features Snapshots \n Letters From \n P.S. \n Reading Lists \n Comments \n Essays \n Responses \n \n Discussions Interviews \n Roundtables \n Letters to the Editor \n News & Events \n \n Video \n Books & Reviews Review Essays \n Capsule Reviews \n FA Books \n \n Classroom \n About Us Submissions \n Staff \n Employment \n Advertising \n Sponsored Sections \n Contact Us \n History \n \n Subscribe \n \n\n \n  \n       \n\n   \n\n   \n   \n       \n     Home \xe2\x80\xba Features \xe2\x80\xba Snapshots \n    \n         \n           Viva el Populismo? \n           The Tense Future of Latin American Politics \n          \n             \n                \n      By Cynthia J. Arnson and Carlos de la Torre      \n                \n       CYNTHIA J. ARNSON is director of the Latin American Program at the Woodrow Wilson International Center for Scholars. CARLOS DE LA TORRE is director of international studies and professor of sociology at the University of Kentucky, Lexington. They are the editors of Latin American Populism in the Twenty-First Century (Woodrow Wilson Center Press and The Johns Hopkins University Press, 2013), upon which this essay draws. \n See more by Cynthia J. Arnson See more by Carlos de la Torre      \n     \n           \n       April 16, 2014      \n          \n    \n         \n      Venezuelan President Nicolas Maduro waves to supporters during a campaign rally on April 6, 2013 (Courtesy Reuters)      \n    \n\n               \n   \n   \n    \n    \n         \n\n \n \n \n'
In [43]:
popTokens = nltk.word_tokenize(populismo)
In [45]:
len(popTokens)
Out[45]:
1274

Podemos mediante ensayo y error obtener el comienzo del documento

In [65]:
poTokens = popTokens[131:550]
In [68]:
poTokens[1:20]
Out[68]:
['Tense',
 'Future',
 'of',
 'Latin',
 'American',
 'Politics',
 'By',
 'Cynthia',
 'J.',
 'Arnson',
 'and',
 'Carlos',
 'de',
 'la',
 'Torre',
 'CYNTHIA',
 'J.',
 'ARNSON',
 'is']
In [69]:
popTokenText = nltk.Text(poTokens)
In [70]:
popTokenText.concordance('populism')
Building index...
Displaying 1 of 1 matches:
 are the editors of Latin American Populism in the Twenty-First Century ( Wood

In [72]:
popTokenText.concordance('venezuela')
Displaying 1 of 1 matches:
hich killed hundreds of civilians. Venezuela is still suffering the consequence

In [73]:
popTokenText.concordance('maduro')
Displaying 2 of 2 matches:
, 2014 Venezuelan President Nicolas Maduro waves to supporters during a campai
g Chávez’s death in March 2013 , Maduro won a special election by a mere 1.

In [74]:
popTokenText.concordance('chávez')
Displaying 4 of 4 matches:
launch the political career of Hugo Chávez , one of the officers involved. In 
of the officers involved. In 1998 , Chávez made a successful bid for the presi
e vote. He remains president today. Chávez redistributed wealth and created ne
er rate has more than doubled since Chávez first took office in 1998. Today , 

In []: