Primer on Natural Language Processing (NLP)
Primer on Natural Language
Processing (NLP)
Author: Susanne
Lomatch
Natural language processing (NLP) is an outgrowth of the field of
computational linguistics, the statistical and/or rule-based modeling of
natural language from a computational (or algorithmic) perspective. Natural
language is any language that arises as an innate facility for language
possessed by the human intellect; it may be spoken, signed or written. Machine
learning (ML) algorithms are used in conjunction with language models to
recognize text in NLP systems, which may also employ speech models and
hardware/software specialized to process and recognize speech or even signed
(gesture-based) language.
Natural speech processing
(NSP) may be considered a separate discipline, since it involves speech models
(derived from phonetics as opposed to text linguistics) and speech signal processing methods
(acoustics). However, natural language started as spoken language, evolving
into written language. Evolution has rendered both a “synchronous” human skill,
and it makes sense to integrate language and speech models and processing for
systems that handle both. In general I follow the industry convention and use
NLP to include NSP, though NSP is not strictly an adjunct. Many in the field
refer to NSP practically as automatic speech recognition (ASR), which deals with the analysis of the linguistic content of
speech to automatically turn spoken words into text, though this is really a
subset of NSP. The opposite, natural speech synthesis
(NSS), deals with text-to-speech, is also a subset of NSP, and is a difficult
problem in its own right.
Natural language understanding (NLU) is the comprehensive usage of NLP to
enable machines or AI systems the ability to “comprehend” machine-recognized
text and/or speech. It is a grand challenge of AI, simply because
“understanding” and “comprehension” differ from recognition/translation, even
if that recognition/translation is experienced and proficient.
There are plenty of
AI applications of NLP that do not require NLU, though NLU will revolutionize
AI systems when and if it is achievable to an acceptable level (i.e., the
system that employs it can reliably pass a suitable Turing Test, which may be
an upper limit of NLU, granted). It is an active area of debate as to what
precisely defines NLU, as the question of “what is thought” or “what is
consciousness” enters into the equation.
I think three key
aspects are required for NLU: the ability to draw inferences from
recognized text/speech, the ability to effortlessly disambiguate words or
phrases within context (objective semantics), and the
ability to handle abstraction (cross-model or spatial) which ties to
understanding metaphors (subjective semantics). All three of these involve knowledge-based
reasoning. If we are to believe Searle [1], something even more than these
three aspects is required, which is why NLU is a grand challenge, much like cognitive
thought and consciousness.
Major
applications of NLP/NSP include:
• Machine
translation: automatic
translation from one natural language to another
• Information
retrieval and extraction:
extraction is the additional recognition and tagging of semantic
information, or of particular information into a structured representation
(e.g., relationships, sentiment); both IR and IE are used for data mining, and as enablers for
question-answering, expert systems, VPAs, etc.; may also include multilingual
or cross-language information retrieval/extraction; IE will be a crucial
component of the Semantic Web
• Sentiment
analysis/opinion mining:
application of IE and NLP to analyze and extract subjective content in
text/speech information, specifically overall contextual polarity or
writer/speaker attitudes (as is done with Twitter and other social media feeds)
• Summarization: automatic summarization of large quantities
of text into a concise, abbreviated version
• Question-answering: provide a relevant answer to a user query
• User
interfaces: natural language
capable interfaces to specialized user systems
• Expert
systems: e.g. Watson
• Virtual
personal assistants: e.g.
Siri, Evi, Google Majel
• Intelligent
gaming
• Intelligent
databases: e.g. the Semantic
Web
• Dialogue
systems: a natural language
capable chatterbot with a good handle of human dialogue
In a separate review
(AI Review, Part 2) I included a short
description on how Watson and Siri use NLP/ASR
to accomplish tasks.
Key components of
NLP/NSP systems are:
• Natural language and speech models
• Modular software and hardware engines for
text and/or speech processing and recognition; may also include optical character recognition (OCR) capability
• ML algorithms that are applied to train
modules using training data (e.g., text corpora and speech corpora) to
recognize text/speech “off-line”, and that are integral in processing
text/speech “on-line”
Classification of
NLP systems in terms of functional capability is useful, and is effectively
accomplished through identifying the representation of “knowledge levels” that
humans use to extract meaning from text or spoken language, adapted from [2]:
NLP/NSP
“knowledge levels” representation:
• Acoustic/phonetic
level: handles the physical
properties of speech sounds (phones), and how to
form phonemes, a set of phones that are cognitively equivalent, and the smallest
segmental unit of sound employed to form meaningful contrasts between
utterances [acoustic/audio signal processing, phonetic parsing, phonetic segmentation, phonetic transcription, acoustic/phonetic speech recognition]
• Phonological
level: handles the abstract
characterization, meaning and interpretation of speech sounds (phonemes) within
and across words, including phonotactic, alternant and prosodic content and
context, syllables, the application of phonological rules,
and how to form morphemes, the smallest semantically meaningful unit in a language [phonological
parsing, speech processing and speech recognition]
• Morphological
level: handles the analysis
and interpretation of the smallest parts of words that carry a meaning
(morphemes), including suffixes and prefixes, and how to form words [morphological parsing]
• Lexical
level: handles the lexical meaning of
words and parts of speech analyses, and how to derive units of meaning [lexical parsing, part-of-speech tagging]
• Syntactic
level: handles the
structural rules and roles of words and sentences (grammar), and how to
analyze, interpret and form sentences [grammar parsing]
• Semantic
level: handles how to
analyze and interpret possible meanings of a sentence by focusing on the
interactions among word-level meanings in the sentence, including the semantic
disambiguation of words with multiple senses (word-sense disambiguation), and how to express meaning in a semantic representation
[semantic parsing, semantic role labeling/tagging, semantic corpora]
• Discourse
level: handles the
structural rules and roles of different kinds of text using document structures,
and how to analyze, interpret and form dialogues [discourse analysis]
• Pragmatic
level: handles knowledge and
meaning assigned to text or dialogue as a result of outside world knowledge,
i.e., from outside the contents of the document, and how to use outside
knowledge, skill and reasoning/inferencing (among
other methods and tools) to analyze, interpret, express or apply contextual
meaning [pragmatic analysis, AI expert system, knowledge base]
• Cognitive
level: handles the
understanding and usage of natural language and dialogue in some defined and
testable capacity and fluency, e.g. via a Turing Test [NLU system, Artificial
Cognitive System]
I have added key
machine processing functions covering the text/speech analysis, interpretation
and generation at each level in brackets. This processing integrally includes
ML techniques and algorithms in modern NLP systems. The graphic in Fig. 1 was
taken from [2], and is a simple but good example of how an NLP system might be
structured according to these levels to process a voice command that gets
turned into an executable UNIX command.
A few notes about
the above language knowledge levels.
Semantics contrasts with syntax, the study of the combinatorics of units of a language (without
reference to their meaning), and pragmatics, the
study of the relationships between the symbols of a language, their meaning,
and the users of the language. In some representations, meaning at the semantic
level is context-independent or context-free, with context-dependence or
context-sensitive analyses pushed to the pragmatic level.
However, word-sense disambiguation is context-sensitive, and is commonly dealt
with at the semantic level (though not very well by some NLP/AI systems –
how about these rather simple examples: “can you go get the file I need to
sharpen this tool, as this file recommends” and “I turned up the bass on the
radio while cleaning the bass I caught for dinner”). In the English language
alone, the most commonly occurring verbs each have eleven meanings (or senses)
and the most frequently used nouns have nine senses, but humans are able to
unambiguously understand and select the one sense or meaning that is intended
by the author or speaker. Disambiguation in an NLP system may rely on local
context and a corpus containing the frequency with which each sense occurs at the semantic
level through semantic parsing and tagging, or pragmatic knowledge outside of a
document or speech, such as a common sense ontology or knowledge base.
When I talked about
natural language understanding (NLU), I mentioned objective vs. subjective
semantics. Word-sense disambiguation is an objective process, as word sense is
usually a static quantity in a language, even if common sense is applied. I
contrast this with metaphor or other figures of speech, which are subjective in natural language.
From a neuroscience point of view, metaphor is a language abstraction that is
manifested as the cognitive linkage of seemingly unrelated concepts in the
brain. These linkages are driven by genetics and learning, and may be the
result of greater hyperconnectivity between language centers (most notably the angular gyrus) and
other modes of processing (such as visual or auditory) in the brain [3]. This
cross-modal abstraction generates some simple common metaphors (“loud shirt”
“sharp cheese” “bright sound”), as do spatial abstractions (“life is a climb on
a tall mountain”). It is clear (at least to me) that such abstractions are the
hallmark of the human cognitive process, and a result of the highest levels of
language and cognitive processing in the brain, relying on the super-nested,
hyperconnected neural architecture to make it all happen. This is at the root
of why NLU and cognitive architectures are grand challenges in AI. As such, I
have added a “cognitive level,”
which describes NLU systems that are able to understand, use and apply natural
language to a level of competency and fluency that might include metaphor and
other nontrivial figures of speech.
Not only are the
levels of the above language knowledge representation interdependent, they can
interact in a dynamic sense, in a variety of orders (i.e., nonsequentially and
synchronously). These levels are structured hierarchically, though there exist
bidirectional interdependencies (see Fig. 2) – information gained at a
higher level of processing can be used to assist in a lower level of analysis,
e.g. pragmatic knowledge may be used to learn, classify and contextually use or
speak a new word, disambiguate word or phrase meanings, or to draw inferences
from a body of text. Cognitive level processing would potentially make this
synchronous understanding and expression of speech and language seem
effortless, intelligent and fluent. Machine learning (ML) is a key part of
processing at every level. Ultimately, an NLU system might resemble a neural
map of how the brain processes speech and language in a unified sense.
Key architectural
and algorithmic approaches to NLP systems are driven by linguistic models and
theories, and are typically classified as symbolic,
statistical or stochastic, connectionist or hybrid [2]. I have expanded upon each of these approaches below,
including basic concepts and Wiki links to concepts that deserve more depth, so
that readers can dig deeper to understand those concepts; I have also provided
links to recent research work. As a complement, I recommend the other two
primers I have prepared, on machine learning (ML) and knowledge representations & acquisition.
In recent years, NLP
research and systems have been based primarily on statistical and connectionist
approaches employing ML, extending older symbolic paradigms targeting the tasks
of part-of-speech tagging, chunking and parsing to statistical or connectionist language modeling and knowledge
representations. A comprehensive survey
of NLP research work incorporating more recent developments in deep learning,
and other worthy approaches such as recursive distributed networks, self-organizing
maps and Bayesian belief networks, is not readily available, and so the
embedded references below are meant to serve as the basis of an ongoing survey
that I intent to build upon. The challenge is to use newer methods to
effectively address language structure beyond local sequences, such as
long-term dependencies, nested/recursive structure, and hetero-associative
and/or cross-modal phenomena in language.
Recent NSP/ASR
research and systems have followed a similar progression, having embraced
successful statistical approaches early. A decent 2009 review by industry and
academic leads [4] highlighted the status of the field and what the grand challenges
are going forward. Some of the points made on speech cognition follow what I
discussed above regarding NLU. As the review also indicates, state-of-the-art
ASR systems are designed to transcribe spoken utterances, and do not tackle
more complex gradations of speech comprehension. As an intermediate solution
short of a cognitive systems approach, they recommend a comprehension “mimicry”
system that is trained on acquiring speech and language knowledge in much the
same way that a human child progresses through the process. Such an approach
indicates the importance of the learning process; to make such a solution
efficient and practical it seems to me that the cognitive systems approach is
still required in some form, otherwise we are either looking at just another
variant of Watson, which fills a large air conditioned room, or a cheaper
system that employs many tricks and kluges to feign simpler comprehension (the
equivalent of a chatbot).
Key architectural
approaches to NLP/NSP:
• Symbolic
o
Based on
explicit representations of facts about language through well-understood knowledge representation schemes and associated algorithms
o
Dominated
by the Chomskyan theories of linguistics (Chomsky hierarchy)
and automata theory, which seek to unify natural languages and
machine languages
o
Enlivened
by newer approaches that focus on functional knowledge engineering, including ontologies and embedded reasoning
o
Examples:
§ Formal
rule-based systems
• Focused on formal grammar, a
set of formation rules for strings in a formal language: the rules describe how
to form strings from the language's alphabet that are valid according to the
language's syntax (compare natural language syntax to programming language syntax)
• A grammar does not describe the meaning of
the strings or what can be done with them in whatever context (semantics,
pragmatics), only their form
• Chomsky hierarchy and its mapping to
(automata theory equivalents):
o
Regular grammars (finite state machines or finite state automata)
o
Context-free grammars (pushdown automata)
o
Context-sensitive grammars (linear bounded automata)
o
Unrestricted grammars / recursively enumerable languages (Turing machines)
• Cross-classes of grammars exist within the
Chomsky hierarchy:
o
Fillmore’s
case grammar
o
Head-driven phase structure grammar
§ Logic-based
systems
• Focused on formal semantics, the understanding of linguistic meaning by
constructing precise mathematical models of the principles that speakers use to
define relations between expressions in a natural language and the world which
supports meaningful discourse
• Models use first-order logic as
a semantic representation, or categorical logic in
some cases; formal logic is generally applied
• Gödel’s completeness theorem ties formal semantics to formal
syntax/grammar in first-order logic
• Examples:
o
Categorical grammar
and combinatory categorical grammar
§ Functional-based
systems
• Models that consider languages to have
evolved under the pressure of the particular functions that the language system
has to serve
• Driven by semantics, discourse and pragmatics
(and NLU and human cognition) with a
focus on context in a dynamic sense
• Examples:
o
Cognitive grammar
and cognitive semantics
o
Computational semantics (including the Semantic Web)
o
Natural
language knowledge levels (shown above in this primer)
o
Advantages:
§ Well understood in terms of formal
descriptive/generative power and practical applications
§ Can be used for modeling phenomena at various
linguistic knowledge levels (multiple dimensions of patterning)
§ Computationally efficient algorithms for
analysis and generation
§ Work well when the linguistic domain is small
and well-defined
o
Disadvantages:
§ Tend to be fragile, leading to parsing
failures – cannot easily handle minor, yet non-essential deviations of
the input from the modeled linguistic knowledge, unless robust parsing
techniques are added
§ Don’t scale very well
§ Require use of experts such as linguists,
phonologists, and domain experts, since such models cannot be instructed to
generalize (learn from example)
o
Algorithms:
§ Natural language parsers
(speech/text) can be designed to target phonological, morphological, lexical,
syntactic, semantic information, but in general focus on the syntactic through
the definition of a grammar
§ Examples:
• Phonological and morphological parsers: finite state transducer
• Syntactic parsers: a list focusing on
context-free grammars is HERE
• Semantic parsers: shallow parsers, deep
parsers, contextual parsers
• Statistical/Stochastic
o
Based on
various probabilistic techniques to develop approximate generalized models of
linguistic phenomena, derived from actual examples/samples of these phenomena
(e.g., linguistic corpora)
o
Extends formal
automata algorithms (or other symbolic algorithms) to include probabilistic
states
o
Employs
machine learning (ML) to train models (estimate probabilistic model parameters)
against sample data; models are then used in turn to recognize language patterns
or sequences to some level of accuracy, efficiency, etc.
o
Examples:
• Extends finite-state machines (regular
grammars) to include states that have two sets of probabilities associated with
it: one determines which symbol to emit from this state (emission or output probabilities);
the second set determines which state to visit next (transition probabilities)
• Parameter learning/estimation algorithms
(e.g. Baum-Welch or forward-backward) are applied to find a set of state
transition and output probabilities via some defined criterion, given an output
sequence or set of sequences (e.g. training data)
• Search optimization algorithms (e.g. Viterbi) are applied
to find the input sequence that is most likely to have generated the output
sequence, using the trained HMM
• Also classified as sequence learners, dynamic
Bayesian networks, or a type of Markov network (see below)
• Especially useful in recognizing
temporal-based sequences in ASR (a good tutorial on HMMs in ASR is located HERE [Jua04] and HERE [Rab04])
• For a review of their use in part-of-speech tagging see [Jur08]
• Probabilistic language models that predict
the next item in a contiguous sequence of n
items from a given sequence of text or speech, in the form of an (n-1)th order Markov model
• Items can be phonemes, syllables,
letters, words or base pairs according to the application
• N-grams are collected from a text or speech corpus
• Effective at
modeling language data; not meant to model more complex, long-range
dependencies in language
• Hybrid models
incorporate Bayesian inference (maximum a priori likelihood and maximum a
posteriori estimates)
• For a survey of
these models see [Jur08]: as they state, recent research applying n-gram
models focuses on very very large n-grams, e.g., in 2006, Google publicly
released a very large set of n-grams that is a useful research resource,
consisting of all the five-word sequences that appear at least 40 times from
1,024,908,267,229 words of running text; there are 1,176,470,663 five-word
sequences using over 13 million unique words types; large language models
generally need to be pruned to be practical, using techniques found HERE [Chu07] and elsewhere
§ Probabilistic
context-free grammars (PCFG)
• Generally extends context-free grammars to
include probabilistic states
• Probabilities can be assigned based on
rule-use as exemplified by training data - the probability of each rule’s
“significance” can be determined based on the frequency of the rule’s
contribution to successful parses of training sentences
• As with the HMM, parameter learning
algorithms (e.g. inside-outside) and parsing optimization algorithms (e.g. CYK-WCFG) apply
• Recent implementation examples: HERE [Pet06]
§ Statistical
semantic parsing
• Extends semantically-annotated grammars to
include probabilistic states
• Various ML techniques and statistical parsers
are applied for parameter learning and parsing optimization, with the
approaches dependent on the level of semantic information (e.g. shallow vs.
deep)
• This area has received much research focus in
the last decade, given the motivation to accurately recognize semantic content
in speech and text, following initial work by Gildea and Jurafsky on semantic role labeling
§ Hybrid or semantic HMMs
• Each HMM hidden state can generate a sequence
of words or correspond to multiple observations; in the case of semantics,
hidden states are semantic slot labels, while the observed words are the
fillers of the slots - usage is defining how a sequence of hidden states,
corresponding to slot names, could be decoded from (or could generate) a
sequence of observed words
• Generative models with two components: the P(C) component represents the choice of
what meaning to express; it assigns a prior over sequences of semantic slots,
computed by a concept n-gram; P(W|C) represents the choice of what words
to use to express that meaning; the likelihood of a particular string of words
being generated from a given slot - it is computed by a word n-gram conditioned on the semantic slot
• Applied to semantic “understanding”
processing in dialogue systems; see [Jur08] for more detail
• These models are very similar to the HMM
models for named entity recognition
o
Advantages:
§ Effective in modeling language performance
through training based on most frequent language use
§ Especially useful in modeling linguistic
phenomena that are not well understood from a competence perspective, e.g.
speech; many modern ASR systems are
based on HMMs applied to phoneme recognition (a good tutorial on HMMs in
ASR is located HERE [Jua04] and HERE [Rab04])
§ Effectiveness is highly dependent on the
volume of training data available; generally more training data translates to
better performance
§ Statistical models (esp. HMM) are more robust
at handling variations and noise, and can be used to model nuances and
imprecise language concepts
o
Disadvantages:
§ Run-time performance is generally linearly
proportional to the number of distinct classes (symbols) modeled, and thus can
degrade considerably as classes increase; this holds for both training and
pattern classification
§ Effectiveness is tightly bound to extensive,
representative, error-free text corpora and speech corpora, the
production of which may be a time-consuming and error-prone process, depending
on the application; to answer this deficiency, recent research focuses on
systems trained on unannotated data (unsupervised learning)
§ For the task of predicting the probabilities
of sentences for a given language using n-gram
models, n-gram counts become unreliable with large n, as the
number of possible n-grams grows with the number of distinct words to
the power of n
o
Algorithms: See above examples
• Connectionist
o
Based on
complex, massively interconnected sets of simple, nonlinear components that
operate in parallel
o
Trained
networks are generally applied to recognize language and speech patterns
(sequences and higher level structures)
o
Recent
NLP research has focused on these approaches (especially multilayer networks)
to address linguistic deep structure (in the semantic, pragmatic and cognitive
meaning senses), where deep learning and cortical learning are applied to learn
such deep structure features, avoiding time consuming parse trees
o
Recent
NSP/ASR research has also focused on these architectures, especially stochastic
networks
o
Examples:
§ Artificial
neural networks (ANNs)
• Network nodes are artificial neurons
• Network connections are (synaptic) weights
encoding the strength of the connection
• Each network node is associated with an
activation or transfer function F(w(x)) that
takes as input a set of activation weights for the node's parent variables (dendrites)
and outputs a threshold value (axon) that propagates to the input of the next
layer through a (synaptic) weight function
• Acquired knowledge through training using ML
techniques is stored in the pattern of interconnection weights among components
– weights are updated/changed through learning
• Performance is gauged by number and type of
inputs/outputs, number of nodes and layers, connectivity, choice of activation
threshold/function, choice of training process and update function, among other
metrics
• Stochastic artificial neural networks (SANN) possess stochastic neuron transfer functions
and/or stochastic weights to allow for random fluctuations, rendering them
generally more robust
• Classes
of ANNs applied to NLP/ASR:
o
Nonrandom
ANNs (NANNs) or stochastic ANNs (SANNs); note applications can include either,
in various geometries described below (taken from [Hend10] and a variety of
more recent sources; some of the geometries listed may not have been
specifically applied to NLP/ASR, but may have potential use)
o
Multilayer feedforward
neural networks
§ Information flows in one direction, with no
intralayer connections between hidden states (acyclically directed)
• Used for function approximation,
categorization or classification, and sequence modeling
• Computes a function from a fixed-length
vector of input values to a fixed-length vector of output values
• Usually trained via backpropagation, an
iterative gradient descent process with potentially slow, nonoptimal
convergence
• Generally insufficient for NLP tasks where
inputs/outputs are arbitrarily long sequences (wordy sentences)
§ Auto-associative
MLP or autoencoder
• Target output pattern/layer is identical to
the input pattern/layer, with the number of hidden layer nodes considerably
less than the input/output layer nodes
• Hidden layers are used as encoders, enabling
the learning of compressed representations while reducing the dimension of the
feature space
• Generally more tractable than standard MLPs
in determining optimal parameter values during weight training: optimal weight
values can be derived using linear techniques, such as SVD [Bou88]
• For many hidden layers, as problem becomes
increasingly intractable (computationally intensive nonlinear optimization),
special pre-training schemes have been developed HERE [Hin06(1)] by treating each bi-layer as a
restricted Boltzmann machine (RBM, see below): learned feature activations of
one RBM are used as the ‘data’ for training the next RBM in the stack; after
the pre-training, the RBMs are ‘unrolled’ to create a deep autoencoder, which
is then fine-tuned using backpropagation of error derivatives
• Successfully applied to ASR and image
processing, and to deep structure NLP using deep networks/learning (see below)
o
Multilayer recurrent
neural networks (RNN)
§ Information can generally flow in arbitrary
directions (cyclically directed or undirected), with intralayer
connections between hidden layers, forming internal addressable memory states
and allowing for dynamic feedback
§ Recurrent connections between hidden layers
allow the network to compute a compressed representation that includes
information from previous compressed representations; by performing this
compression repeatedly, at each step adding new input features, a recurrent
network can compress an unbounded sequence into a finite vector of hidden
features
§ RNNs can be divided into two useful classes: autonomous RNNs with fixed temporal
inputs (Hopfield nets, Boltzmann machines, RAAMs, BAMs), and non-autonomous RNNs with time-varying
inputs (recurrent MLPs, recursive ANNs, LSTMs, MRNNs, BRNNs, FRNNs)
§ Recurrent
MLP
• An MLP (feedforward) whose hidden layers
include internal links that loop back towards the input (forming internal
addressable memory states), allowing for arbitrary sequence length inputs
• Also known as a simple recurrent network
(SRN) or Elman network for the special case of three layers
• Generally insufficient for NLP tasks where
there are nonlocal sequence correlations (structure more complex than
sequences), as the pattern of interconnections between hidden layers imposes an
inductive bias in learning
• A recurrent MLP applied to input structures
more complex than sequences, such as trees, graphs or functional logic
(semantic structures such as predicate-argument): a copy of the network is made
for each node of the tree or graph, and recurrent connections are placed
between any two copies which have an edge in the tree or graph [Fra98]
• Pattern of interconnection between hidden
layers better reflects locality in the structure being modeled
• RvNNs have been successfully used to learn
distributed representations of structured objects such as logical terms, see
RAAMs below
§ Recursive
auto-associative memory/network/autoencoder (RAAM)
• A recurrent auto-associative network/encoder
applied to input structures more complex than sequences, such as trees, graphs
or functional logic [Pol90, Gol96]
• Labeling RAAMs (LRAMMs) have found the most
use in NLP [Gol96]
• Also classified as a type of RvNN
§ Simple
Synchrony ANNs (SSNN)
• Similar to an RvNN, but hidden layer
connectivity is tailored to minimize inductive bias [Hend10]
§ Multilayer
Hopfield
network
• A multilayer recurrent ANN with symmetric (undirected)
connections between nodes, including hidden layers, allowing for undirected
information flow and for internal addressable memory states; each neural node
has a binary activation function
• Also known as an auto-associative memory network
• Stochastic version of a multilayer Hopfield
network, or more generally, a recurrent SANN with stochastic transfer functions
(threshold values are activation probabilities)
• Also a type of Markov random field
(MRF) or Markov network
• Become generally intractable as learning is
applied to sufficiently complex multilayer networks: e.g., exact maximum
likelihood learning is intractable, as exact computation of both the data-dependent
expectations and the model’s expectations takes a time that is exponential in
the number of hidden units
• Good review HERE
§ Restricted
Boltzmann machines (RBM)
• BMs that do not allow intralayer connections
between hidden units, and are tractable as learning is applied to complex
multilayer networks (therefore these are not technically recurrent networks,
but the stochastic binary equivalent of an autoencoder)
§ Long short-term memory network
(LSTM)
• A multilayer RNN in which the hidden layers
are replaced by a “memory block” containing one or more memory cells and a pair
of adaptive, multiplicative gating units which gate input and output to all
cells in the block
• Each memory cell state is associated with a
recurrent self-connected linear unit that allows for the “regulation” of local
error back flow, enforcing non-decaying error flow “back into time”
• LSTMs solve tasks that general RNNs cannot,
due to their failure to learn in the presence of long time lags between
relevant input and target events [Schm12]
§ Multiplicative
RNNs (MRNN)
• An RNN variant that uses multiplicative (or
“gated”) connections which allow the current input character to determine the
transition matrix from one hidden state vector to the next
• Trained using Hessian-free optimization techniques,
MRNNs overcome the difficulties associated with training traditional RNNs,
making it possible to apply them successfully to challenging sequence problems;
see detail and NLP applications HERE [Sut11]
• LSTM (above) makes it possible to handle
datasets which require long-term memorization and recall but even on these
datasets it is outperformed by using a standard RNN trained with the HF
optimizer; see [Sut11]
§ Multilayer
bidirectional
recurrent ANN (BRNN)
• A multilayer recurrent ANN with bidirectional
connections between nodes, including hidden layers, allowing for information
flow in both directions (feedback, feedforward), and for internal addressable memory
states
• Successfully applied to ASR [Schu97]
• A bidirectional associative memory network
(BAM) is a type of BRNN
(bi-layer feedback network), and the hetero-associative counterpart to an
auto-associative Hopfield network; also classified as a hetero-associative
memory network or a “resonance network”; see [Kos88]
§ Fully
recurrent neural networks (FRNN)
• Each node in the network has a directed or
undirected connection to every other node (input, output or hidden), and a
time-varying activation/transfer function, allowing for internal memory states
and dynamic temporal behavior (feedback and feedforward)
• Practical use is limited due to complexity
and intractability of training
§ Variants of MLPs incorporating biologically-inspired
features
§ Introduces layers that apply convolutions on
their input which take into account locality information in the data, i.e. they
learn features from image patches or windows within a sequence
§ Exploit spatially local correlation by
enforcing a local connectivity pattern between neurons of adjacent layers; the
input hidden units in the n-th layer
are connected to a local subset of units in the (n-1)-th layer, which have spatially contiguous receptive fields
§ Architecture confines the learnt “filters”
(corresponding to the input producing the strongest response) to be a spatially
local pattern (since each unit is unresponsive to variations outside of its
receptive field with respect to the retina, in the case of vision); stacking
many such layers leads to “filters” (not anymore linear) which become
increasingly “global” (i.e spanning a larger region of pixel space)
§ Operationally, this architectural approach
specifies a feature extractor (that “sparsely” filters raw input) which outputs
a feature vector to a trainable classifier (an MLP or other ANN, such as a SANN
or DNN), and these are jointly trained jointly to optimize the class scores or
probability estimates
§ CNNs are easier to train than other networks,
including DNNs and DBNs that employ layer-by-layer deep learning techniques
(see below and [Ben09]); multitask learning and direct optimization of a joint
objective function can be accomplished with good results
§ Applied to NLP: HERE [Ben03], HERE [Mor05], HERE [Col08], HERE [Wes08], HERE [Mni07]
and HERE
[Mni09]
• For the NLP case, the CNN results in word
feature vectors which are trained to reflect exactly the word similarities
which are needed by the probability estimation model, and they work better than
finding similarities based on some independent criteria (such as latent
semantic indexing), or trying to specify them by hand
• Implementations outperform competitive n-gram models; as the size of the word
window n was increased, the CNN
continued to improve with larger n
while the n-gram models had stopped
improving, indicating that the representation of word similarity feature
vectors succeeded in overcoming the unreliable statistics of large n-grams
§ Generally multilayer ANNs (feedforward or
recurrent) whose hidden layers are successively trained layer-by-layer (“deep learning”
– e.g., unsupervised learning/training starts with the base MLP, RNN,
autoencoder or RBM, and its output is used as training data input for the next
higher layer structure, with the process repeated until the entire network is
initialized, and can seed a final step of supervised training or fine-tuning),
rendering the overall DNN an improved generative model, especially on deep
language or speech structure; this process differs from the usual single-pass
training; see [Ben09] HERE and refs therein; see also a good tutorial HERE
§ Approach is to exploit layer-local
unsupervised criteria, i.e., the idea that injecting an unsupervised training
signal at each layer may help to guide the parameters of that layer towards
better regions in parameter space
§ CNNs, including the “Neocognitron,” are
considered types of DNNs (class of machines/models that can learn a hierarchy
of features by building high-level features from low-level ones, thereby
automating the process of feature construction)
§ RNNs can also be viewed as DNNs (an RNN can
be “unfolded in time” by considering the output of each neuron at different
time steps as different variables, making the unfolded network over a long
input sequence a very deep architecture)
§ Alternative deep learning approaches
involving the constraint of feature vectors at each layer in a deep
(convolutional) net to be sparse and overcomplete for unsupervised pre-training can be found HERE [lec07], and refs therein
§ Alternative deep learning approaches
employing Hessian-free optimization techniques may be useful for ML in
difficult to optimize deep architectures, such as RNNs, see HERE [Mar09]
o
Topological maps
• Biologically-inspired model; maps
resemble topographically organized maps found in cortices of mammalian brains
• Defines a topology-preserving mapping
between an often highly dimensional input space and a low dimensional, most
typically 2-D, space; self- organization is introduced by having the notion of
neighboring units, whose weights are adjusted in proportion to their distance
from the winning unit
• Self-organization process can
discover semantic relationships in sentences; SOMs have also been used in
practical speech recognition [Koh90]
• See a review HERE
• NLP advantages: unsupervised
learning, self-organization, emergent structure from representations,
plasticity modeling, Hebbian learning
• Choice
of ANN/training algorithm(s) depends on the application and the type of ML to
be applied: supervised, reinforcement, unsupervised, deep
o
Supervised
learning:
§ Multilayer perceptron / backpropagation algorithm
• Learns by classifying patterns
• Recent NLP research examples: HERE [Col07]
o
Specific
task: semantic role labeling via shallow semantic parsing
• Applied to ASR: HERE [Hos99]
o
Reinforcement
learning:
§ ANN / dynamic programming
== “neurodynamic programming”
• Learns by optimizing using observations and
feedback
• For SANNs, stochastic states are Markov decision processes, which might be a useful approach for
dialogue architectures and conversational agents (i.e., applied to NLP using
SANNs; such an approach might be based on [Lev00] and might even include hybrid
deep belief networks, CNNs, RNNs)
o
Unsupervised
and semi-supervised learning:
§ ANN / hidden Markov model (HMM)
• Learns by unsupervised induction –
weights are updated through a stochastic HMM process (HMM acoustic models are
used to train the ANN recognizer; semi-supervised learning and joint
optimization can be performed as well)
• Applied specifically to ASR: HERE [Bea01] and HERE [Scha00]
• Learns by building a map using input examples
(unsupervised clustering)
• Recent NLP research examples: HERE [Bur11], HERE [Pov06], HERE [Li02] and HERE [Hon97]
o
Deep
learning:
§ Deep ANN (DNN) / semi-supervised learning, multitask learning
• Learns features relevant to the tasks at hand
given very limited prior knowledge
• Tasks are integrated into a single system,
which is trained jointly; all the tasks except the language model are
supervised tasks with labeled training data; the language model is trained in
an unsupervised fashion, jointly with the other tasks
• Recent NLP research examples: HERE [Col08] and HERE [Wes08]; see also a good review HERE
§ DNN / hidden Markov model (HMM)
• Applied to context-dependent, large
vocabulary ASR specifically, but can be generalized to NLP
• Learns phones/phonemes using semi-supervised
learning, through pre-training of a DNN
• Outperform conventional context-dependent
HMMs 5-10%
• Recent research examples: HERE [Dah12] and HERE [Jai12]
§ Recursive autoencoder (RAAM) /
semi-supervised learning
• General tool for predicting parse tree
structures for NLP
• Captures recursive features of natural
language
• Learned feature representations capture
syntactic and compositional-semantic information
• Outperform PCFGs on functionality and
accuracy
• Recent research examples: HERE [Soc11]
• Network nodes are random variables having a Markov property, arranged in an undirected graph (a.k.a. Markov network)
• Network connections or nodal edges represent a probabilistic dependency between variables
• Markov networks can represent certain dependencies that a Bayesian network cannot (such as cyclic dependencies); it cannot represent certain dependencies that a Bayesian network can (such as induced dependencies)
• Markov models are (generally) noncausal, though hybrid approaches can incorporate some causality
• Classes
of MRFs applied to NLP/ASR:
o Hidden recurrent/recursive networks
• As explained above, generally intractable unless a deep learning technique using pre-training of an RBM is applied, or other deep learning strategies; see [Ben09], [Sal09] (HERE) and [Sal10] (HERE) and refs therein; see also [Myl99] for approximate variational techniques
• Unlike DBNs (see below), the approximate inference procedure, in addition to an initial bottom-up pass, can incorporate top-down feedback, allowing deep BMs to better propagate uncertainty about, and hence deal more robustly with, ambiguous inputs [Sal09]
• Recent research applications for ASR focus on using the pre-trained RBM to seed a DNN; see [Dah12] and [Jai12]
•
Recent research applications for NLP focus on
using a pre-trained RBM to seed a BM; see [Sal09] and [Sal10]
o Hybrid random fields
•
As described above, extensively applied in ASR;
for a review of recent approaches, see refs in [Dah12] HERE
§ Conditional random fields (CRFs)
•
Used for structured sequence prediction, where
context is important
•
A supervised (discriminant) ML technique
• CRF models define a conditional probability p(Y|x) over label sequences Y given a particular observation sequence x, rather than a joint distribution over both label Y and observation x sequences; as such, these models support tractable inference and represent the data without making unwarranted independence assumptions
• CRFs outperform both HMMs and MEMMs on a number of real-world sequence labeling tasks; see refs in HERE [Wal04]
•
Recent ASR research examples: HERE
[Hif09] and HERE
[Yu10]
• Also known as sigmoid belief networks (SBNs): Hybrid generative models where the model construction module uses Bayesian network techniques, while the probabilistic reasoning module is implemented as a massively parallel Boltzmann machine; for good early reviews, see HERE [Nea90], HERE [Hin95], HERE [Sau96] and HERE [Myl99]
• A recent DBN architecture example often cited for its efficient deep learning algorithm is found HERE [Hin06(2)] and a short review HERE; this work introduced the concept of using pre-training of RBMs for tractable, efficient overall training (fine-tuning) of a DBN or DNN, whereby the first two layers of the architecture form an undirected associative memory and the remaining layers form a directed acyclic graph that converts the representation in the associative memory into observable variables such as the pixels of an image or the probability of words in a sequence
• Hybrid CNN-DBN architectures have also been formulated that show improved performance for vision tasks; see HERE [Lee09]
• Recent ASR research examples: HERE [Moh12] and HERE [Sar11]
• Recent NLP research examples: HERE [Tit07], HERE [Hen08], HERE [Des09] and HERE [Zho10]
§ Bayesian networks (BNs)
• Network nodes are Bayesian random variables, arranged in a directed acyclic graph
• Network connections represent probabilistic
dependence between variables, with conditional probabilities encoding the
strength of the dependencies
• Each network node is associated with a conditional
probability distribution function P(x|Parents(x)) that
takes as input a particular set of values for the node's parent variables and
gives the probability of the variable represented by the node
• Acquired knowledge is stored in the pattern
of conditional probabilities, set by a priori knowledge (nodal evidence) and
changed through learning or inferencing
• Networks calculate posterior probabilities of
an event as output, given a priori nodal evidence – Bayesian nets
generate a probabilistic output of event occurrences
• Bayesian networks/models are causal, and are
often referred to as belief networks that employ evidentiary reasoning and
inferencing
• Classes
of BNs applied to NLP/ASR:
o
Hybrid random fields
§ These include dynamic BNs which are a hybrid
MRF (HMMs, hybrid HMMs), CRFs, SBNs and DeepBNs, see above
§ Notable is the burgeoning field of “statistical relational learning,” which features hybrids of MRF/BN models,
and ML and probabilistic reasoning techniques to deal with them (specifically
probabilistic inductive logic programming), as applied to designing database
systems; see HERE [Get07], HERE, HERE and HERE
§ Hierarchical
temporal memory (HTM), a hybrid
BN-MRF model inspired by the visual cortex and applied to model artificial vision
among other applications, may be useful for NLP/ASR modeling as well; the power
of this model is its predictive capability, while it is limited in its
application to time-dependent sequences over more complex data structures (this
limitation may be overcome with some theoretical effort)
• HTM assumes a hierarchy of nodes where each
node learns spatial coincidences and then learns a mixture of Markov models
over the set of coincidences; the hierarchy of the model corresponds to the
hierarchy of cortical regions in the brain; the nodes in the model correspond to
small regions of cortex
• HTM networks use Bayesian belief propagation for inference
• See review HERE [Geo09]
• HTM has been applied to ASR: HERE [Dor06]
o
Fully Bayesian networks
§ No hybrid approach, these networks fully
implement the directed acyclic and causal features of a full BN; see HERE [Pea87] and [Pea88]; for a decent review with comparisons to SBNs and
BMs, see [Nea90]
§ These networks found early application in
expert systems, specifically medical diagnosis systems
§ Advantages of these networks include that
they can be used as a compact representation for many naturally occurring distributions,
where the dependencies between variables arise from a relatively sparse network
of connections, resulting in relatively small conditional probability tables,
where representation of the problem domain probability distribution can be
constructed efficiently and reliably, assuming that appropriate high-level
expert domain knowledge is available; such BNs offer a framework for
constructing algorithms for different probabilistic reasoning tasks
§ As discussed in the early literature, BNs suffer
from finding appropriate and tractable learning procedures; approaches like
gradient ascent must be constrained to avoid invalid solutions or getting stuck
as local maxima; likewise, for a general BN structure, probabilistic reasoning
algorithms, such as the combinatorial optimization problem of finding the
maximum a posteriori probability (MAP), is an NP-hard problem (see [Myl99] and refs therein, and also HERE [Abd98])
§ Useful references:
• Approaches for constructing BNs from sample
data, combining domain expert knowledge with ML: HERE [Hec95]
• Survey of algos for real-time BN inference: HERE [Guo02]
• Finding MAPs using modified BNs (high-order
RNNs): HERE [And09] and HERE [And12]
§ Recent NLP research examples: HERE [Wei06], HERE [Mey05], and undoubtedly many others, as I have not done an exhaustive
search
o
Advantages:
§ Architectures are self-organizing, in that
they can be made to generalize from training data even though they have not
been explicitly “instructed” on what to learn; this can be very useful when
dealing with linguistic phenomena that are not well-understood – when it
is not clear what needs to be learned by a system in order for it to effectively
handle such a phenomenon (unsupervised learning and deep learning)
§ Successful in discovering useful features of
words and joint models of multiple tasks; exploit similarities between words by
training feature-based representations of them
§ Architectures are fault tolerant, due to the
distributed nature of knowledge representation/(memory) storage; as increasing
numbers of their components become inoperable, their performance degrades
gradually
§ Weights or probabilities can be adapted in real-time
to improve performance
§ Effective in modeling nonlinear
transformations between inputs and outputs, due to the nonlinearity within each
computational element
§ Can improve computational efficiency and
accuracy over traditional tagging and parsing techniques; specifically, successful
at improving accuracy over n-gram models by exploiting similarities
between words, and thereby estimating reliable statistics even for large n-grams
§ Convolutional nets (CNNs), recursive nets
(RvANNs) and self-organizing maps (SOMs) are increasingly showing improved
utility for NLP tasks and modeling over other approaches
§ Bayesian nets/hybrid random fields/stochastic
nets vs. deterministic neural nets:
• How do you deal with missing data in a neural
network?
• How do you find out how sure a neural network
is of its answer?
• How did the neural network derive its answer,
what was its logic process?
o
Disadvantages:
§ Possible for a system to be over-trained and
thus diminish its capability to generalize – only the training data can
be recognized
§ Due to their massive parallelism, and their
usual implementation on non-parallel architectures (as modular components),
such systems may be ineffective from a runtime complexity perspective for many
real-time tasks in human-computer interaction
§ General concern is for either intractable
training of sufficiently complex networks that might represent language tasks
and modeling beyond simple sequences, or the relatively long training times
that even may exist in tractable approaches such as deep learning and DBNs;
there is also the issues of inductive bias and intractable or unacceptably
tractable probabilistic reasoning/inferencing
§ Specific criticism of deep belief networks
(DBNs) and associated layer-by-layer deep learning techniques: as classifiers,
DBNs can underperform other learning algos/classifiers; one reason cited
[Mca08] is due to the fact that DBNs iteratively learn “features-of-features”
in each level’s RBM; if the network has an appropriate implementation for the
task at hand, this will potentially lead to a very high accuracy classifier; however,
if the network and data do not work perfectly well together, this
feature-of-feature learning can potentially lead to recursively learning
features that do not appropriately model the training data (however this
property may be useful for language data, which has inherent recursive
structure); one solution cited is to use appropriate continuous-valued neuron
representations
o
Algorithms: See above examples
• Hybrid
o
Based on
different variations of compound architectures and linguistic models attempting
to use the best approach (symbolic, statistical or connectionist) for a given
modeling subproblem in an application
o
Recent
research focuses on hybrid approaches, which are outlined above under
statistical and connectionist
An example list of
open or commercial toolkits, standards and research groups (by no means
complete, and will be revised periodically):
NLP software
toolkits: GATE, OpenNLP, NLTK, CMUSLM, SENNA; see also Torch5, a collection of ML algos
ASR software
toolkits: HTK (HMM), Sphinx, CSLU, ATT-GRM, SRILM, RWTH, Shout
ASR standards: VoiceXML, (for a
dated review, see HERE), NIST (benchmarks HERE, tools HERE, history HERE)
NLP/ASR research
groups: Stanford NLPG, R. Collobert/IDIAP-RI, OHSU/CSLU, Microsoft Research, UofToronto-CSAI, CMU-SpeechLM
(Disclaimer: This primer is
meant to inform. I encourage readers who find factual errors or deficits to
contact me (contact link below). I also welcome constructive and
friendly comments, suggestions and dialogue.)
References
and Endnotes:
[1] “Watson Doesn’t Know it Won on Jeopardy!” J. Searle, Wall Street Journal, Feb. 2011.
[2] “Natural Language Processing: A Human–Computer Interaction Perspective,” B. Manaris, Advances in Computers (Marvin V. Zelkowitz, ed.), vol. 47, pp. 1-66, Academic Press, New York, 1998.
[3] “A Brief Tour of Human Consciousness,” V.S. Ramachandran, Pi Press, 2004.
[4] “Research Developments and Directions in Speech Recognition and Understanding,” J.M. Baker et al., IEEE Signal Processing Magazine, vol. 75, May 2009. A link to this review can be found HERE.
[Pet06] “Learning Accurate, Compact, and Interpretable Tree Annotation,” S. Petrov, Proceedings of the 21st International Conference on Computational Linguistics, 2006.
[Jua04] “Automatic Speech Recognition – A Brief History of the Technology Development,” B.H. Juang and L.R. Rabiner, 2004.
[Rab04] “Speech Recognition: Statistical Models,” L.R. Rabiner and B.H. Juang, 2004.
[Jur08] “Speech and Language Processing,” D. Jurafsky and J.H. Martin, Prentice-Hall, 2008.
[Chu07] “Compressing Trigram Language Models With Golomb Coding,” K. Church et al., Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007.
[Hend10] “Artificial Neural Networks,” J.B. Henderson, in “The Handbook of Computational Linguistics and Natural Language Processing,” ed. A. Clark et al., 2010.
[Bou88] “Auto-Association by Multilayer Perceptrons and Singular Value Decomposition,” H. Bourlard and Y. Kamp, Biological Cybernetics vol. 59, p.291 (1988).
[Hin06(1)] “Reducing the Dimensionality of Data with Neural Networks,” G. E. Hinton and R. R. Salakhutdinov, Science, vol. 313 (5786), p.504, Jul. 2006.
[Fra98] “A general framework for adaptive processing of data structures,” P. Frasconi et al., IEEE Trans. Neural Networks, vol. 9 (5), p.768, Sep. 1998.
[Pol90] “Recursive distributed representations,” J.B. Pollack, Artificial Intelligence, vol. 46 (1), p.77, Nov. 1990.
[Gol96] “Learning task-dependent distributed representations by backpropagation through structure,” Goller, C.and Kuchler, A., IEEE Conf. on Neural Networks, Jun. 1996.
[Schm12] See J. Schmidhuber’s excellent website on recurrent neural nets, with an emphasis on LSTMs and their application to NLP/ASR, handwriting recognition, etc., linked HERE.
[Sut11] “Generating Text with Recurrent Neural Networks,” I. Sutskever et al., Proceedings of the 28th International Conference on Machine Learning, 2011.
[Schu97] “Bi-directional Recurrent Neural Networks [for Speech Recognition],” M. Schuster and K.K. Paliwal, IEEE Trans. Speech Processing, vol. 45 (11), p.2673, Nov. 1997.
[Kos88] “Bidirectional Associative Memory,” B. Kosko, IEEE Trans. Systems, Map, Cybernetics, vol. 18 (1), Jan. 1988.
[Ben03] “A Neural Probabilistic Language Model,” Y. Bengio et al., Journal of Machine Learning Research, vol.3, p.1137, 2003.
[Mor05] “Hierarchical Probabilistic Neural Network Language Model, AISTATS, 2005.
[Col08] “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” R. Collobert and J. Weston, 2008.
[Mni07] “Three New Graphical Models for Statistical Language Modeling,” A. Mnih and G.E. Hinton, Proceedings of the 24th International Conference on Machine Learning, 2007.
[Mni09] “A Scalable Hierarchical Distributed Language Model,” A. Mnih and G.E. Hinton, Advances in Neural Information Processing Systems 21, 2009.
[Ben09] “Learning Deep Architectures for AI,” Y. Bengio, Foundations and Trends in Machine Learning, vol. 2 (1), 2009.
[Lec07] “Energy-Based Models in Document Recognition and Computer Vision,” Y. LeCun et al., Ninth Intl. Conf. on Document Analysis and Recognition, 2007.
[Mar09] “Deep learning via Hessian-free optimization,” J. Martens, Proceedings of the 27th International Conference on Machine Learning, 2010.
[Koh90] “The Self-Organizing Map,” T. Kohonen, Proc. IEEE, vol. 78 (9), Sept. 1990.
[Col07] “Fast Semantic Extraction Using a Novel Neural Network Architecture,” R. Collobert and J. Weston, 2007.
[Hos99] “Speech Recognition Using Neural Networks,” J.P. Hosom et al., Center for Spoken Language Understanding, OGI, 1999.
[Lev00] “A Stochastic Model of Human-Machine Interaction for Learning Dialog Strategies,” E. Levin et al., IEEE Transactions on Speech and Audio Processing, vol.8, p.11, Jan. 2000.
[Bea01] “Neural Networks in Automatic Speech Recognition,” F. Beaufays et al., Published in Handbook of Brain Theory and Neural Networks, 2000.
[Scha00] “CLSU-HMM: The CLSU Hidden Markov Modeling Environment,” J. Schalkwyk et al., Center for Spoken Language Understanding, OGI, 2000.
[Bur11] “Self organizing maps in NLP: exploration of coreference feature space,” A. Burkovski et al., Proceedings of the 8th international conference on Advances in self-organizing maps, 2011.
[Pov06] “Neural Network Models for Language Acquisition: A Brief Survey,” J. Poveda and A. Vellido, Lecture Notes in Computer Science, vol. 4224, p. 1346, 2006.
[Li02] “A Self-Organizing Connectionist Model of Bilingual Processing,” Bilingual Sentence Processing, P. Li and I. Farkas, vol.59, p.85, 2002
[Hon97] “Self-Organizing Maps in Natural Language Processing,” T. Honkela, Ph.D. Thesis, 1997.
[Wes08] “Deep Learning via Semi-Supervised Embedding,” Weston et al., Proceedings of the 25th International Conference on Machine Learning, 2008.
[Dah12] “Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition,” G.E. Dahl et al., IEEE Trans. Audio, Speech and Lang. Processing, Jan. 2012.
[Jai12] “Application of Pretrained Deep Neural Networks to Large Vocabulary Conversational Speech Recognition,” N. Jaitly et al., U. Toronto, Mar. 2012.
[Soc11] “Parsing Natural Scenes and Natural Language with Recursive Neural Networks,” R. Socher et al., Proceedings of the 28th International Conference on Machine Learning, 2011.
[Sal09] “Deep Boltzmann Machines,” R. Salakhutdinov and G.E. Hinton, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, 2009.
[Sal10] “Efficient Learning of Deep Boltzmann Machines,” R. Salakhutdinov and H. Larochelle, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
[Wal04] “Conditional Random Fields: An Introduction,” H.M. Wallach, 2004.
[Hif09] “Speech Recognition Using Augmented Conditional Random Fields,” Y. Hifny and S. Renals, IEEE Trans. Audio, Speech and Lang. Processing, Feb. 2009.
[Yu10] “Deep-Structured Hidden Conditional Random Fields for Phonetic Recognition,” D. Yu and Li Deng, Proc. Interspeech, 2010.
[Nea90] “Learning Stochastic Feedforward Networks,” R.M. Neal, UofToronto, Nov. 1990.
[Hin95] “The wake-sleep algorithm for unsupervised neural networks,” G.E. Hinton et al., Science, vol. 268, p.1558, 1995.
[Sau96] “Mean Field Theory for Sigmoid Belief Networks,” L.K. Saul et al., Journal of Artificial Intelligence Research, vol. 4, p. 61, 1996.
[Myl99] “Massively Parallel Probabilistic Reasoning with Boltzmann Machines,” P. Myllymaki, Applied Intelligence, vol. 11, p.31, 1999.
[Hin06(2)] “A fast learning algorithm for deep belief nets,” G.E. Hinton et al., Neural Computation, vol. 18, p.1527, 2006.
[Lee09] “Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations,” H. Lee et al., Proceedings of the Twenty-sixth International Conference on Machine Learning, 2009.
[Moh12] “Acoustic Modeling using Deep Belief Networks,” A. Mohamed et al., submitted to IEEE Trans. Audio, Speech and Lang. Processing, 2012.
[Sar11] “Deep Belief Nets for Natural Language Call-Routing,” R. Sarikaya et al., ICASSP, 2011.
[Tit07] “Constituent Parsing with Incremental Sigmoid Belief Networks,” I. Titov and J. Henderson, Proc. 45th Meeting of Association for Computational Linguistics, 2007.
[Hen08] “A Latent Variable Model of Synchronous Parsing for Syntactic and Semantic Dependencies,” J. Henderson et al., Proceedings of the CoNLL-2008 Shared Task, 2008.
[Des09] “A Deep Learning Approach to Machine Transliteration,” T. Deselaers et al., Proceedings of the Fourth Workshop on Statistical Machine Translation, 2009.
[Zho10] “Active Deep Networks for Semi-Supervised Sentiment Classification,” S. Zhou et al., Harbin Inst. Of Tech., 2010.
[Mca08] “Document Classification using Deep Belief Nets,” L. McAfee, Stanford CS, 2008.
[Get07] “Introduction to statistical relational learning,” L. Get00r and B. Taskar, MIT Press, 2007.
[Geo09] “Towards a Mathematical Theory of Cortical Micro-circuits,” D. George and J. Hawkins, PLoS Computational Biology, vol. 5, 2009.
[Dor06] “Hierarchical Temporal Memory Networks for Spoken Digit Recognition,” J. van Doremalen, Ph.D. thesis, Radboud University, 2006.
[Pea87] “Evidential Reasoning Using Stochastic Simulation of Causal Models,” J. Pearl, Artificial Intelligence, vol. 32, p.245, 1987.
[Pea88] “Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference,” J. Pearl, Morgan Kaufmann Publishers, San Mateo, CA, 1988.
[Abd98] “Approximating MAPs for belief networks is NP-hard and other theorems,” A.M. Abdelbar and S.M. Hedetniemi, Artificial Intelligence vol.102, p.21, 1998.
[Hec95] “Learning Bayesian Networks: The Combination of Knowledge and Statistical Data,” D. Heckerman et al., Machine Learning, vol. 20(3), p.197, Sept. 1995.
[Guo02] “A Survey of Algorithms for Real-Time Bayesian Network Inference,” H. Guo and W. Hsu, American Association for Artificial Intelligence Technical Report, 2002.
[And09] “Finding MAPs Using High Order Recurrent Networks,” E.A.M. Andrews and A.J. Bonner, Proceedings of the 16th international conference on neural information processing: Part I, 2009.
[And12] “Finding MAPs using strongly equivalent high order recurrent symmetric connectionist networks,” E.A.M. Andrews and A.J. Bonner, Cognitive Systems Research vol.14, p.50, 2012.
[Mey05] “Comparing Natural Language Processing Tools to Extract Medical Problems from Narrative Text,” S.M. Meystre and P.J. Haug, AMIA Symposium Proceedings, 2005.
[Wei06] “Bayesian Network, a model for NLP?” D. Weissenbacher, Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, 2006.